Yusheng Xiang

**Band 97**

Y. Xiang

AI and IoT Meet Mobile Machines: Towards a Smart Working Site

**AI and IoT Meet Mobile Machines: Towards a Smart Working Site** 

Yusheng Xiang

**AI and IoT Meet Mobile Machines: Towards a Smart Working Site**

#### **Karlsruher Schriftenreihe Fahrzeugsystemtechnik Band 97**

Herausgeber

#### **FAST Institut für Fahrzeugsystemtechnik** Prof. Dr. rer. nat. Frank Gauterin Prof. Dr.-Ing. Marcus Geimer Prof. Dr.-Ing. Peter Gratzfeld Prof. Dr.-Ing. Frank Henning

Das Institut für Fahrzeugsystemtechnik besteht aus den Institutsteilen Bahnsystemtechnik, Fahrzeugtechnik, Leichtbautechnologie und Mobile Arbeitsmaschinen.

Eine Übersicht aller bisher in dieser Schriftenreihe erschienenen Bände finden Sie am Ende des Buchs.

# **AI and IoT Meet Mobile Machines: Towards a Smart Working Site**

by Yusheng Xiang

Karlsruher Institut für Technologie Institut für Fahrzeugsystemtechnik

AI and IoT Meet Mobile Machines: Towards a Smart Working Site

Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der KIT-Fakultät für Maschinenbau des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation

von Yusheng Xiang, M.Sc.

Tag der mündlichen Prüfung: 22. September 2021 Hauptreferent: Prof. Dr.-Ing. Marcus Geimer Korreferent: Prof. Dr.-Ing. Tamim Asfour

#### **Impressum**

Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe

KIT Scientific Publishing is a registered trademark of Karlsruhe Institute of Technology. Reprint using the book cover is not allowed.

www.ksp.kit.edu

*This document – excluding parts marked otherwise, the cover, pictures and graphs – is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0): https://creativecommons.org/licenses/by/4.0/deed.en*

*The cover page is licensed under a Creative Commons Attribution-No Derivatives 4.0 International License (CC BY-ND 4.0): https://creativecommons.org/licenses/by-nd/4.0/deed.en*

Print on Demand 2022 – Gedruckt auf FSC-zertifiziertem Papier

ISSN 1869-6058 ISBN 978-3-7315-1165-6 DOI 10.5445/KSP/1000143073

# **Vorwort des Herausgebers**

Methoden der künstlichen Intelligenz (KI) rücken zunehmend in den öffentlichen Fokus. Sie sind grundsätzlich geeignet, Systeme, die mit klassischen Regelungsmethoden nicht oder nur mit sehr hohem Aufwand geregelt werden können, zu steuern bzw. in einer erwarteten Weise zu beeinflussen. Maschinelle Lernverfahren (ML) und Künstlich Neuronale Netze (KNN) sind heute sehr häufig erforschte Methoden der KI.

Gleichzeitig ist der Magatrend der Digitalisierung zu beobachten: Nahezu alle Produkte, die ein Kunde nutzt, besitzen eine elektronische Steuerung, sind mit dem Internet verbindbar oder können Informationen aus dem Internet ziehen. Es gibt kaum einen Ort auf der Welt, an dem man nicht ins Internet der Dinge (IoT) gelangen kann.

Die Karlsruher Schriftenreihe Fahrzeugsystemtechnik widmet sich Themen der Steuerung und der Digitalisierung von Fahrzeugen. Für die Fahrzeuggattungen Pkw, Nfz, Mobile Arbeitsmaschinen und Bahnfahrzeuge werden in der Schriftenreihe Forschungsarbeiten vorgestellt, die Fahrzeugtechnik auf vier Ebenen beleuchten: das Fahrzeug als komplexes mechatronisches System, die Fahrer-Fahrzeug-Interaktion, das Fahrzeug im Verkehr und Infrastruktur sowie das Fahrzeug in Gesellschaft und Umwelt.

Großes Potential bieten KI- und IoT-Methoden aber auch im Bereich der Logistik auf Großbaustellen. Die Planung von Fahrwegen der Fahrzeuge und der einzelnen Arbeitsaktivitäten sind Beispiele für eine Optimierung, die in diesem Band 97 gezeigt werden. Herr Xiang wählt die Großbaustelle in Wuhan zum Bau des Huoshenshan-Krankenhauses als Beispiel zur Motivation seiner Arbeit. Die Confict Based Search (CBS) entwickelt Herr Xiang zunächst so weiter, dass er mit einer Zwei-Ebenen-Struktur (einschränkende Randbedingungen und optimale Pfadplanung) für mehrere Maschinen in einer sehr kurzen Rechenzeit die optimalen Pfade planen kann. Im Weiteren zeigt er, wie er mit einem Mehr-Ebenen-Layer eine Karte in Echtzeit erstellen kann, mit dessen Hilfe Arbeitsmaschinen eine Pfadplanung durchführen können. Für die Ermittlung von Teilzyklen einer Arbeitsaufgabe entwickelt er ein Convolutional, Recursive, Deep Neural Netwerk. Mit einer Signifikanz von über 95% kann er damit die Zyklusteile eines Radladers erkennen. Auch entwickelt er ein umfangreiches Datenset zur Erkennung von mobilen Arbeitsmaschinen. Er nutzt dazu den YOLOv3 Algorithmus und erreicht Erkennungsraten von deutlich über 80%. Nicht zuletzt vergleicht er eine Kommunikation zwischen Maschinen auf Basis von IEEE 802.11 (WLAN) und 5G. Auch wenn das 5G-Netz nachvollziehbar wesentlich höhere Datenübertragungsraten in den von Herrn Xiang untersuchten Szenarien erreicht, so zeigt er auch die Nachteile bei Fahrzeuggeschwindigkeiten oberhalb von 40 km/h.

Karlsruhe, im September 2021 Prof. Dr.-Ing. Marcus Geimer

# **Preface**

The dissertation includes some parts of my research during the last three years as a PhD student, research scientist, and visiting scholar at Karlsruhe Institute of Technology, Robert Bosch LLC, and University of California, Berkeley. Here I would like to sincerely thank some people who did contribute to my dissertation and my growth.

First of all, I am very grateful for the guidance from Prof. Dr. Marcus Geimer at KIT. During the time when we taught lectures at the university, Prof. Geimer shows me what means responsible and rigorous with action. Also, I appreciate the time we generated novel ideas and wrote academic papers together. It is my pleasure to cooperate with such a great educator. His virtues will affect me for my whole life.

Also, I would like to thank Dr. Steffen Mutschler at Bosch Rexroth, who trusts and offers me this opportunity to do my research at a prestigious university and company. During the early phase of my research, he guided me thoroughly and patiently to become familiar with the projects quickly. Without a doubt, he did an outstanding contribution to my personal and professional growth in my PhD early stage. All my "achievements" during my PhD time cannot be accomplished without him.

Another important person is Dr. Christine Brach at Bosch Rexroth. After Dr. Steffen Mutschler left our department, she took the responsibility to guide me and spent plenty of time on me, and even she must be extremely busy as a department leader. Her support much accelerates my projects and research. Also, her tolerance for my unconventional viewpoints guarantees the innovation of my research. Her openness, wisdom, and encouragement broadened my horizon and let me have the freedom to consider problems at a higher level.

Prof. Dr. Samuel S.Mao at UC, Berkeley makes me realize the difference between tiny innovation and disruptive innovation. Inspired by the conversations with him, I have a more in-depth insight into Silicon Valley's charm and comprehensively understand the idea of making human life better. As an outstanding professor with more than 46,000 citations and a pioneer to transfer the research results into the industry through high-tech startups, he guides me to continue to break through my limits. An outstanding research in engineering convinces both professors and venture capital partners.

I sincerely thank Dipl.-Ing. Norman Brix for his valuable expertise and impressive know-how in both software and hardware. What's more, he is my best German language teacher.

In addition, my special thank gives to the company, Robert Bosch LLC, where I gained knowledge and happiness at the same time in the past four years, started from internship. Besides these thanks, I am grateful for my graduate- and undergraduate students who supported me with their great enthusiasm and aptitude, and Karlsruhe House of Young Scientist for providing me the scholarship to visit abroad.

Finally, I would like to thank my partner, M.Sc. Tianqing Su. Her excellent programming skill and in-depth knowledge in the field of computer science help me overcome many technical difficulties. Not to mention, she always encouraged and accompanied me during the time before deadlines and appointments, making these hard times colorful and beautiful. Moreover, I appreciate her patience to listen the content of my "boring" dissertation almost every day.

Thank you!

Yusheng Xiang on Mar. 01, 2021

# **Abstract**

Infrastructure construction is society's cornerstone and economics' catalyst. Therefore, improving mobile machinery's efficiency and reducing their cost of use have enormous economic benefits in the vast and growing construction market. For this purpose, many methods have been proposed by industry and academia during the past few decades which contribute to even better products. As research in this area becomes more mature, significant optimization of single construction machine is less likely to exist. Therefore, instead of focusing on improving the performance of single construction machinery, I considered a group of construction machinery as a whole system to improve the productivity of the working site. In this thesis, I envision a novel concept smart working site to increase productivity through fleet management from multiple aspects and with Artificial Intelligence (AI) and Internet of Things (IoT).

Investigating the famous construction site for the hospital, namely Huoshenshan, where the project was finished at an unprecedented speed in Wuhan during the coronavirus outbreak in 2020, the most impressive distinguishing features can be concluded as a large amount of machines investment and the well-ordered coordination. Inspired by this particular working site, this thesis aims to present the approaches to substitute some human coordinators using AI and IoT and thus make the concept of a smart working site offering high productivity closer to reality.

Firstly, I introduced a novel multi working-machines pathfinding algorithm to solve the path conflicts among machines and prioritize the more critical machines. The proposed algorithm outperforms the State-of-the-Art (SOTA) solution in pathfinding time, whereas it achieves optimal solution. To navigate the optimal path from the start point to the destination, an accurate localization, and mapping algorithm is indispensable. Therefore, a multi GPS/IMUs Simultaneous Localization And Mapping (SLAM) system based on commodity sensors on account of dynamically changing of the working site and noise of sensors was developed. This SLAM system offers the location and terrain information to supporting the successful path planning. Since some difficult tasks may still be finished by human drivers in the next decade, I endow my AI system the capability to predict the motion of manned machines. Concretely, I introduced combined neural networks to detect the manned machines' working process and validated with experimental data did on a wheel loader. Because the selected combined neural network is more suitable for transfer learning compared to the SOTA solution of the Multivariable Time Series Classification algorithm, my deep learning model has better generalization capability on different working sites and is more robust against the diversity of construction machines. Then, I created a visual monitoring system for the safety of participants without localization equipment. Given that the machines in a closed site can be treated as an L4 automation driving task, I built a mobile machines dataset to be used as a base dataset to train the SOTA deep-learning-based visual algorithm. By taking full advantage of L4 features, I proved my approach is extremely effective. To share all the information mentioned above between the command center and construction machines, I evaluated two major wireless communication systems for working sites, i.e., WLAN-based IEEE 802.11p and cellular network 5G, to achieve the seamless share of the large volume of information. The research about 5G indicates the working site setup, and the research about ad-hoc networks presents the handover strategy.

This thesis contributes to the path making Wuhan's speed a normal speed in the future, by quantitatively evaluating feasibility considering cutting-edge AI and IoT technologies.

Keywords: Smart Working Site, Multi Working Machine Pathfinding Algorithm, Multivariable Time Series Classification Algorithm, Mobile Machines Dataset, SLAM, 5G, IEEE 802.11p

# **Kurzfassung**

Der Bau von Infrastrukturen ist ein Eckpfeiler der Gesellschaft und ein Katalysator der Wirtschaft. Daher haben die Verbesserung der Effizienz mobiler Maschinen und die Senkung ihrer Nutzungskosten enorme wirtschaftliche Vorteile auf dem riesigen und wachsenden Baumarkt. Zu diesem Zweck wurden in den letzten Jahrzehnten viele Methoden von Industrie und Wissenschaft vorgeschlagen, die zu noch besseren und leistungsfähigeren Produkten beitragen. Da die Forschung in diesem Bereich reifer wird, ist es weniger wahrscheinlich, dass eine signifikante Optimierung einer einzelnen Baumaschine vorliegt. Anstatt mich auf die Verbesserung der Leistung einzelner Baumaschinen zu konzentrieren, habe ich eine Gruppe von Baumaschinen als Gesamtsystem betrachtet, um die Produktivität der Baustelle zu verbessern. In dieser Arbeit stelle ich ein neuartiges Konzept Smart Working Site vor, um die Produktivität durch Flottenmanagement unter verschiedenen Gesichtspunkten und mit künstlicher Intelligenz (KI) und Internet der Dinge (IoT) zu steigern.

Bei der Untersuchung der berühmten Baustelle des Krankenhauses, in Wuhan nämlich Huoshenshan, auf der die Baustelle während des Ausbruchs des Coronavirus im Jahr 2020 mit beispielloser Geschwindigkeit abgeschlossen wurde, ist uns die große Menge an Maschineninvestitionen und die geordnete Koordination aufgefallen. Inspiriert von dieser speziellen Arbeitsstelle, ist das Ziel dieser Arbeit, die Ansätze vorzustellen, eine "intelligenten Baustelle" um einige menschliche Mitarbeiter durch AI und das IoT zu ersetzen und die Produktivität zu erhöhen, der Realität näher zu bringen.

Zunächst habe ich einen neuartigen Algorithm für die Routenplanung mehrerer Arbeitsmaschinen eingeführt, um die Kollision des Pfads zwischen Maschinen zu lösen und die kritischeren Maschinen zu priorisieren. Der vorgeschlagene Algorithmus übertrifft die SOTA-Lösung (State-of-the-Art) in der Rechenzeit, während eine optimale Lösung erzielt wird. Um den optimalen Weg vom Startpunkt zum Ziel zu navigieren, ist ein genauer Lokalisierungs- und Zuordnungsalgorithmus unerlässlich. Aus diesem Grund habe ich ein SLAM-System (Simultaneous Localization And Mapping), das auf Warensensoren basiert, da sich die Arbeitsstelle dynamisch ändert und das Rauschen der Sensoren auftritt. Dieses SLAM-System bietet Lokalisierungs- und Geländeinformationen zur Unterstützung der erfolgreichen Routenplanung. Da einige schwierige Aufgaben in den nächsten zehn Jahren möglicherweise noch von Menschen erledigt werden, kann ich auch auf der intelligenten Baustelle nicht auf diese verzichten. Konkret habe ich kombinierte neuronale Netze eingeführt, um den Arbeitsprozess der bemannten Maschinen zu erfassen und mit experimentellen Daten validiert, die mit einem Radlader erstellt wurden. Da das ausgewählte kombinierte neuronale Netzwerk im Vergleich mit SOTA-Lösung zur Klassifizierung multivariabler Zeitreihen besser für das Transferlernen geeignet ist, verfügt unser Deep-Learning-Modell über eine bessere Generalisierungsfähigkeit auf verschiedenen Arbeitsplätzen und ist robuster gegenüber der Vielfalt von Baumaschinen. Anschließend habe ich ein visuelles Überwachungssystem für die Sicherheit der Teilnehmer ohne Lokalisierungsausrüstung erstellt. Da die Maschinen an einem geschlossenen Standort als L4 des automatisiertes Fahren behandelt werden können, habe ich einen Datensatz für mobile Maschinen erstellt, der als Basisdatensatz zum Trainieren des auf SOTA Deep Learning basierenden visuellen Algorithmus verwendet werden kann. Durch die volle Nutzung der L4-Funktionen habe ich bewiesen, dass unser Ansatz äußerst effektiv ist. Um alle oben genannten Informationen zwischen der Kommandozentrale und den Baumaschinen auszutauschen, habe ich zwei wichtige drahtlose Kommunikationssysteme für Arbeitsstätten evaluiert, nämlich WLAN-basiertes IEEE 802.11p und Mobilfunknetz 5G, um den nahtlosen Austausch des großen Informationsvolumens zu erreichen. Die Forschung zu 5G zeigt die Einrichtung der Arbeitsstelle an, und die Forschung zu Ad-hoc-Netzwerken präsentiert die Übergabestrategie.

Diese Arbeit trägt dazu bei, dass durch die quantitative Bewertung der Machbarkeit unter Berücksichtigung modernster AIoT-Technologiendie die Geschwindigkeit der Bauprojekte in Wuhan in Zukunft zu einer normalen Vorgegensweise bringen zu können.

Schlüsselwörter: Smart Working Site, Multi Working Machine Pathfinding Algorithm, Multivariable Time Series Classification Algorithm, Mobile Machines Dataset, SLAM, 5G, IEEE 802.11p

# **Contents**






# **Acronyms and symbols**

#### **Acronyms**


#### CRDNN A combination of CNN, RNN, and DNN

#### CSMA/CA





## **Symbols**

#### **Path Planning for Machines Fleet Management**


#### **SLAM for Machines on a Smart Working Site**




#### **Motion Prediction of Manned Working Machines**



#### **Wireless Communication System**

d Distance between transmitter and receiver f<sup>c</sup> Correction factor for IEEE802.11p analytical estimation method G<sup>r</sup> Receive antenna gain


# **1 Introduction**

Recent progress in Artificial Intelligence (AI) and Internet of Things (IoT) makes me believe that the traditional construction and mining site can be extended with an autonomous system or at least an assistant system, for the purpose of increasing productivity and safety performance, as well as reducing the cost for projects.

In this thesis, I focus on the fleet management logistic problem regarding productivity and safety on smart working sites. Currently, because of a lack of effective cooperation among individual machines, heavy machinery wastes a lot of time waiting for each other, and conflicts exist. Therefore, the total number of affordable construction machines within a working site is limited, and the overall productivity is less satisfying. The basic idea is to solve the moving conflicts inside of the working site so that more machines can be invested in performing tasks simultaneously and thus significantly improve productivity. A persuasive instance to show the benefit of introducing the logistic solution into the working site is the construction site for the famous hospital, namely Huoshenshan, the project in Wuhan during the coronavirus outbreak in 2020. By investing an extraordinary amount of working machines and human cooperators and manually coordinating the machines to avoid conflicts among them, the construction project was finished at an unprecedented speed. Apparently, the economic cost of running such a construction site can be quite expensive due to the salary for experienced workers. Also, since the logistic problem is Non-deterministic Polynomial-time hard (NP-hard), computer algorithms can better perform a series of optimization objects, such as shorter moving distance and realtime performance. In light of that, I try to use AI to replace human decisions in the working site and utilize IoT technology to share the information among the participants seamlessly. Compared to the proposals for an individual machine, which usually maximally increase up to 50% performance, fleet management solutions show a potential to improve the productivity of a working site several times.

The thesis presents a series of approaches contributing to the machines cooperation strategy, machine motion prediction, site visual monitoring, and site wireless communication. Then, by combining these cornerstone technologies systematically, I demonstrate the blueprint of the future working site benefited from AI and IoT technologies.

## **1.1 Problem and Goal Statement**

Apparently, the concept of smart working needs a system solution and thus cannot be solved with only one approach. To make it closer to reality, serious technical difficulties should be overcome. In particular, I focus on the following research questions to achieve AI-based fleet management:


I try to answer the aforementioned questions with cutting-edge AI and IoT technologies, which show the surplus human performance capability and thus have become the technological waves in the second decade of 21th century. Since the machines inside of a large working site can be diverse due to their various function, a fully autonomous system for all the machines in the working site is still very challenging; thus, I endow my AI and IoT system with the ability to cooperate with the human being.

While the individual ideas to increase both productivity and safety whereas reduce the cost of projects have been intensively tackled from either management's or technical view, the improvement of the entire working site by an isolated technology is limited. I conjecture that it might because a working site is a complicated system and thus need concerted efforts. In this thesis, I first advance the current SOTA technologies for the individual aforementioned research question and then find the appropriate configuration to show the benefit of smart working site as a whole system.

## **1.2 Applications**

I highlight two critical scenarios where the proposed concept smart working site shall be adopted.

In mining site and construction site, there is a high demand for the highly automated and coordinated fleet management from both economic and safety views. Furthermore, since the machines usually operate in a closed area at slower speeds and untrained pedestrians are already kept out of the construction site, it is easier and safer to automate their driving. The unmanned construction and mining site brings enormous benefits regarding safety, productivity, and labor perspective:

• Safety Benefit: Since 1900, over 100,000 coal mine accidents have taken places in the US [1]. In China, the number of deaths in coal mining accidents exceeded 2,000 each year from 1993 to 2010, whereas the peak occurred in 2002 with approximately 7,000 deaths [2]. The coal miners are highly exposed to coal dust and toxic gas, which results in a higher rate of ischemic heart disease and workers pneumococcus [3]. The safety problem can be significant eliminated by unmanned construction and mining site.


## **1.3 Contributions**

The contributions of this thesis are as follows:


while outperforms the SOTA solution [5] in terms of generalization ability, thanks to its faster transfer learning capability.


## **1.4 Thesis Outline**

This thesis is structured as follows: Chapter 2 shows the current contributions and developments on smart working site. Then, Chapter 3 presents the multi workingmachine pathfinding algorithm that is employed. Chapter 4 details the SLAM technologies are used for acquiring the map information and machines' location in the working site. Afterward, Chapter 5 shows the working process recognition through multivariate time series algorithms to predict machines' motion. Followed by Chapter 6, the visual monitoring system is described. Finally, Chapter 7 illustrates the wireless communication system of the working site. Conclusions and outlooks are drawn in Chapter 8.

# **2 Background Knowledge of Smart Working Site**

This chapter discusses some previous contributions in smart working sites with respect to the existing literature. I begin with a brief overview of the novel technologies and ideas on the smart working site here and go deeper at the beginning of the following individual chapters.

## **2.1 Concepts and Consensus**

Smart working site is a novel integrated management and automation model for working sites, which is a high integration of AI, IoT, and traditional construction industry. It takes full advantage of the emerging information technologies such as mobile internet, Artificial Intelligence of Things (AIoT), cloud computing, big data, and focuses on key factors such as people, machines, materials, methods, and the environment to completely change the interaction and working mode on a working site. Obviously, my research also serves the establishment of the smart working site.

In general, construction project management manages construction components, such as workers, materials, and construction machines, aiming to achieve construction objectives quickly and well. Managing a construction site is to make a series of decisions across construction processes utilizing available information and knowledge [6]. The main objective of information management is to support decision-making by ensuring that accurate information is always available at the right time in the right format to the right person [7]. In recent years, there

Figure 2.1: An example of a smart construction site illustrated by Komatsu.

have been many studies on how to afford decision-makers with precise, timely, and well-organized information, e.g., exploiting Building Information Modeling (BIM) [8, 9]; by adopting Information and Communication Technologies (ICT) such as Auto-ID [10, 11, 12, 13], and sensing technology [14, 15].

However, due to the complexity and diversity of construction projects and the increasing demand for high-quality engineering, the decisions made by human beings are considered as more and more unreliable with the explosive growth of information, especially compared with the decision made by or with the help of advanced AI technology. Problems are often manifested in product quality defects, overtime, and over budget, caused by insufficient information, cognitive ability, and time. These problems afflict engineers with traditional management techniques before the AIoT era.

Fortunately, thanks to the tremendous progress and profound influence of AI and sensor technology, Construction Engineering and Management (CEM) is experiencing a rapid digital transformation. With more available tools, both academia and industry have proposed their novel concepts to realize the construction industry's refined management methods and automation. Obviously, the development of smart construction sites is inseparable from the promotion of information technology. It is noteworthy that most researchers adopt the BIM system as the backbone to build up their contributions. Nowadays, the smart construction site is an integration of comprehensive intelligent systems. The interaction of physical space and cyberspace makes the advantages of smart construction sites be fully demonstrated. From the overall perspective, the management and automation of smart working sites determine a construction project's quality. In addition, compared with traditional construction environments, the smart construction site allows construction quality to be fully supervised. The communication between the staff will become easier and more straightforward, which saves working time, promotes work efficiency, and meanwhile ensures the quality of construction projects. Besides that, smart construction sites have become an indispensable and essential component of safe production. Through various monitoring settings installed at the construction site, as well as a more comprehensive intelligent monitoring and prevention system, it can better make up for various omissions in traditional management work.

The characteristic of CEM can be concluded as uniqueness, labor intensive, high dynamics, complexity, and uncertainty. These literature [16, 17, 18] indicate the basic requirements and the consensus of future working sites are following. First, the problems should be prevented before they actually occur. For instance, the digital twins' concept depicts a cyber-physical system where the digital model offers the simulation results to the physical model that inspection data is collected. Analogously, cloud VR/AR solutions realize more interactions between the cyber and physical worlds. Also, the information shall be shared and make the process more transparent through AIoT and blockchain technology. Consequently, the BIM model can be updated timely whenever the physical model is changed. Moreover, the process is expected to be supervised by using smart robotics, e.g., unmanned aerial vehicles. Last but not least, the AI system shall visualize the most important and concise information to the human decision-makers with data mining tools or substitute the human decisions directly with the advanced-AI decisions in some cases. Here I found some applications using Natural Language Processing (NLP) to retrieve vital information from the reports for the human decision-maker.

## **2.2 Current Applications and Challenges**

Besides the research topics creating improvements in the equipment systems, including increased engine efficiency, reduced greenhouse gas emission, improved electro-hydraulic control, information technology has attracted more and more attention in the last two decades. Especially, the triumvirate of IoT, AI, and cloud technologies offers new opportunities for the development of new applications on smart construction sites [17], shown in Fig. 2.2.

Figure 2.2: The critical topics for smart working site summarized by Štefanič [17] and us.

For example, in 2006, Lundeen developed a marker-based pose estimation system for excavators in order to determine the three-dimensional positioning and orientation of the trencher [19]. Later, a novel system was described by Turkan that integrates 3D object recognition technology with schedule information into a combined 4D<sup>1</sup> object recognition system focusing on progress tracking in 2011 [20]. Tested on a comprehensive field database acquired during the construction of the structure of the "Engineering V Building" at the University of Waterloo, this system demonstrates a degree of accuracy for automated structural progress tracking and schedule updating that meets or surpasses manual performance. In 2016, Ren and Wu developed a realtime automated anti-collision system that can warn crane operators about potential collisions and then automatically implement collision-avoidance strategies [21]. One advantage of this system is that it does not require additional devices and can be installed in existing crane controllers. In the same year, another novel system was proposed by Liu that combines inclinometer, laser ranging sensors, and wireless communication technologies to monitor lift-thickness during highway construction [22], and Braun presented a concept for an automated comparison between the actual state of construction and the planned state for the early detection of deviations in the construction process [23]. In this concept, the actual state of the construction site is detected by photogrammetric technology. Concretely, dense point clouds are generated by the fusion of disparity maps created with Semi-Global-Matching (SGM). These are afterward matched against the target state provided by a 4D Building Information Model. Also in 2016, based on the deviation between the optimal route determined by extracting nodes from BIM and the actual route of a laborer collected from the Real-Time Location System (RTLS), Kim proposed an automated hazardous area identification model to improve the efficiency of safety management [24].

As we can see, the contributions about the smart working site demonstrate many benefits and are proved to be promising approaches to improve the current daily working site. However, they are not preferable for commercial solutions until 2020, i.e., most working sites in reality are still quite traditional. I conjecture that it may because that these contributions do not improve productivity several times. Most of the current information technologies and corresponding management in

<sup>1</sup> Besides the x,y,z coordinates denoted by the first three dimension, the 4D model describes the schedule information using the fourth dimension.

Figure 2.3: Selected works about smart working site [14, 25, 20, 23, 22, 16, 24, 26, 27].

the earthmoving industry address real-time tracking and productivity estimation of the equipments rather than productivity improvement. As a result, these technologies are considered as incomplete solutions for some end customers, e.g., construction contractors. For instance, productivity estimation function might be challenging to be accepted by some engineers if there are not corresponding productivity-increasing methods. In fact, the McKinsey Global Institute (MGI) analysis found that the construction industry was among the least digitized industries in the total economy, and the annual productivity growth over the past 20 years was only a third of total economy averages<sup>2</sup> . Hence, improving productivity is a very urgent and important issue in the future.

Notably, the world-famous Huoshenshan construction project in Wuhan demonstrated the possibility to significantly boost construction productivity<sup>3</sup> by means of investing in a large number of machines. In light of that, I explore the approach to increase the number of machines in a working site and solve the cooperation and safety problems among machines and workers utilizing AI and IoT technologies in this thesis.

<sup>2</sup> McKinsey: The next normal in construction (2020)

<sup>3</sup> To avoid the misunderstanding, productivity is defined as the efficiency to execute construction projects in this thesis. Although there are no comprehensive analysis of how much faster is the Wuhan's construction site, ultra-rapid is the most well-known feature for this site [28, 29].

# **3 Path Planning for Machines Fleet Management<sup>1</sup>**

Multi working-machines pathfinding solution enables more mobile machines simultaneously to work inside of a working site so that the productivity can be expected to increase evolutionary. To date, the potential cooperation conflicts among construction machinery limit the amount of construction machinery investment in a concrete working site. To solve the cooperation problem, civil engineers optimize the working site from a logistic perspective while computer scientists improve pathfinding algorithms' performance on the given benchmark maps. In the practical implementation of a construction site, it is sensible to solve the problem with a hybrid solution; therefore, in my study, I proposed an algorithm based on a cutting-edge multi-pathfinding algorithm to enable the massive number of machines cooperation and offer the advice to modify the unreasonable part of the working site in the meantime. Using the logistic information from BIM, such as unloading and loading point, I added a pathfinding solution for multi machines to improve the whole construction fleet's productivity. In the previous studies described in Section 3.2, the experiments were limited to no more than ten participants, and the computational time to gather the solution was not given; thus, I publish my pseudo-code, my tested map, and benchmark my results. My algorithm's most extensive feature is that it can quickly replan the path to overcome the emergency on a construction site.

<sup>1</sup> All the figures, text, and results of the presented work in this chapter have been published in our publication [30]. My contribution to the paper is summarized as 100% in terms of conception and methodology, 90% of literature review, 90% of code, 60% of results visualization, and 95% of formulation.

## **3.1 Introduction**

Although many achievements in construction machines with respect to productivity and safety, humans' pursuit of even higher productivity and better safety never stops. In the past decades, civil engineers and construction-industry-related software engineers introduce the BIM [31] as a powerful tool to increase productivity and safety performance whilst reduce the project cost by means of digital technology. In general, BIM provides the 3D or more than 3D model of the construction projects and even the installation sequence of the components to avoid mistakes during the construction stage. With the maturity of BIM, this software and process are adopted for many large and especially complicated construction projects worldwide [32] since the mistakes during the real construction process cause much more severe consequences than those in virtual engineering. Also, BIM is considered as a lifelong software, contributing to not only the early construction phase [31, 33] but also the time after the construction projects are finished [34]. Despite to the preliminary cost for training corresponding laborers and model building in a computer, BIM is a necessary tool for at least large construction projects becomes a consensus. However, although current BIM software defines the start and end points where the material should be transported, concrete paths guiding the trucks to accomplish the goal are not given. Or more generally, an algorithm that determines the paths of the participants in the working site so that they can move to their destination without collision and hesitation is still developing.

As shown in Fig. 3.1, one motivation to combine these path planning algorithms with BIM can be described as determining the construction machines' travel path so that the machines can be expected to move faster and denser without hesitation. To accomplish higher productivity and better safety in the path planning fashion, some researchers calculate and display efficient paths by a given construction site [35, 36] whereas others optimize the construction site layout considering the high productivity of the path [37, 38]. However, most of the current solutions about path planning on a construction site are mainly focusing on individual units, i.e., the interaction and path conflict of the different machines or other agents

Figure 3.1: Overview of the smart construction site concept. I introduce to use artificial intelligence to unify the scheduling of construction machinery on construction sites. Maximize the use of construction machinery on specific construction sites by avoiding conflicts among construction machinery and thus ultimately improve the productivity of the construction sites.

are ignored. Consequently, the working sites' spatial and time utilization is limited. Inspired by warehouse logistics solution [39], where a lot of robotics are working with the commands from path planning algorithms in the meantime and thus achieve considerably higher transport efficiency as a fleet, I envision the Multi-Agents Pathfinding (MAPF) solution can also provide evolutionary to the construction industry. A persuasive instance to show the benefit of introducing MAPF solution into a working site is the construction site for the famous hospital, namely Huoshenshan, in Wuhan during the coronavirus 2019 outbreak. Invested an extraordinary amount of working machines and human cooperators, the construction project was finished at an unprecedented speed, through only manually coordinate to avoid conflicts among machines. Apparently, the economic cost of running such a construction site can be quite expensive due to experienced workers. Also, since the MAPF problem is NP-hard, computer algorithms can surely have a better performance concerning a series of optimization objects, such as shorter moving distance and realtime performance. In light of that, I propose a MAPF solution for working machines as an extension of the BIM system so that more machines can work simultaneously and thereby achieve better holistic productivity.

The aim of this chapter is to extend the current BIM software with AI pathplanning algorithms in order that more machines can work simultaneously since the paths are calculated to avoid collisions. Because the start points and endpoints can be given directly in the BIM system, I devote myselves to the implementation level, i.e., how exactly the machines should achieve their goal set given by the BIM system. I envision that the case in Wuhan will be a normal case in the future by utilizing AI and IoT [40, 41] technology.

The main contributions of this chapter can be sum up as the following points:


The rest of this chapter is organized as follows. Section 3.2 briefly introduces the prerequisite and background knowledge in fields of BIM, path planning methods for construction machines and robotics to understand this chapter quickly. Next, the existing problems are illustrated in Section 3.3. After that, in Section 3.4, I describe the setup of my MAPF approach, including the map to give the start and goal positions, low level for individual pathfinding, and high level for conflicts solving. Then, I show the experiments setup. Followed by Section 3.6, I show my approach's performance by testing on some maps based on real construction sites. Finally, Section 3.7 summarizes the advantages of my approach, and Section 3.8 gives conclusions and envisions the outlook.

## **3.2 Related Works**

### **3.2.1 Building Information Modelling (BIM)**

BIM is a 3D model-based information management process in the field of Architecture, Engineering, and Construction (AEC) that facilitates efficient design and construction processes and inter-organizational collaboration [42]. There is a lot of BIM-based software: Autodesk Revit Building (Revit), ArchiCAD, Bentley, and SolidWorks [43]. By building the whole project virtually before physical construction begins, construction sequencing is determined, including material ordering, fabrication, and delivery schedules for all building components, etc. Therefore, conflict, interference, and collision are avoided in the early stage, contributing to improved site efficiency and reduced cost [44]. As the function inside BIM increases, such as scheduling, virtual reality [45], and logistic management [42], it has been extended to 4 or more than 4 dimensional model. Nowadays, the research about BIM is prosperous. Combining AI and IoT into the BIM system is considered the next potential boom for BIM systems. Survey papers about that can be found in [46, 47].

In order to automate the whole construction site and compute the optimal path for the heavy machines, logistic information is vital, which demonstrates which materials should be placed in which location at which time in the right quantity. Logistics management in construction involves the strategic storage, handling, transportation and distribution of resources, as well as planning of a building site's layout [48]. Whitlock has proposed a desktop approach to adopt BIM for construction logistics management [42]. Such logistic information, for instance, unloading points, on-site arrangements-logistics layouts, which are generated at the outset of the pre-construction process, can be used as the input data for the path planning.

#### **3.2.2 Path Planning for Construction Machines**

On a construction site, there are usually multiple machines working simultaneously in a definite area with given assignments. Therefore, coordinated construction logistics can definitely increase productivity, decrease material usage, and guarantee workers' health and safety [49]. To date, there is plenty of researchers proposed their solution for the logistical problem inside construction sites at diverse levels, i.e., path planning and motion control. In this section, I summarize the previous research about moving paths inside of construction sites.

In the construction industry, the earth-moving sector is among the pioneers in adopting new sensing and information technologies [16], such as bulldozer [50, 51], and grading machines [52]. Given two points A and B on a construction site, the objective is to determine the shortest path from A to B maintaining a safe distance from obstacles. The approach proposed by Kim is a path-planning method for a mobile construction robot to find a continuous collision-free path from the initial position of the construction robot to its target position by improved Bug-based algorithm [53]. The algorithm can work with the disturbance of static and dynamic objects. Obviously, the performance of the approach is based on the accuracy of the sensors. At that time, the methods to acquire site information were still immature. Hence, the spatial model supporting path planning in a partially known and partially unknown environment was brought forward by Lee. Accordingly, the spatial model provides the domain for finding an efficient path on a construction site through the use of an algorithm that combines a shortest path algorithm and a dynamic path-planning algorithm. This approach differs from existing path-planning approaches that assume the construction site is totally known or totally unknown. Thus, problems associated with managing a changing construction environment and ignorance of designated roadway networks are overcome [54]. In the same year, Soltani has compared the performance of different methods for the path planning inside construction sites [55], such as Dijkstra [56], A<sup>∗</sup> [57], and Genetic Algorithm (GA) [58], by evaluating comprehensive multi objects, e.g., site layout representation, distance formulation, hazard zone modeling, and visibility calculations. Also, the author points out the use of Closed-Circuit TeleVision (CCTV) cameras should be considered to enhance site security [55]. Although the simulation results show that the GA has the best performance and the other two algorithms have quite similar results, I conjecture that it might be ascribed to their maps, which are quite easy and do not include difficult obstacles such as bottlenecks. Fairly recently, Song tried to integrate some path planning algorithms into BIM system [35]. The basic idea is to determine the path to transport the materials at the very beginning phase, i.e., the construction site design phase. Also, the study verified the demand for the introduction of path planning by survey questions. The shortcoming of this approach is that the interaction of other participants inside construction sites during the construction project was not taken into account.

So far, the aforementioned studies focus on a path for one machine inside the construction site. In contrast, the following research shows solutions for multi machines working simultaneously on a construction site. An influential study about the path planning on the construction site is from Cheng published in 2012 [59]. The objective of the paper is to provide the n best and safest paths between two points in a work area while maintaining a safe distance from identified obstacles. Here the approach proposed in the paper uses Dijkstra algorithm and solves the path of different participants in sequence. Due to the limited recognition distance of ultra-wideband sensors, the usage of this approach might not suitable for huge working sites. Also, since the paths of agents are calculated one by one, the computation efficiency is not ideal from today's point of view in 2020. 4 years after Cheng's study, the research from Bohacs shows the difficulties of path planning for a construction site. Also, they use A<sup>∗</sup> as basic and develop an algorithm to let limited machines can cooperate without collision [60] within a small map. Concretely, they showed the demo about 3 machines in a 10 × 10 grid map. As the flow chart in their paper shows, the algorithm depends on the condition statements. As a result of that, it cannot perform with all the dispensable computation effort of the computer.

As the development of information technology, Štefanič provides an overview of emerging smart construction applications in areas such as construction monitoring, construction site management, safety at work, early disaster warning, and resources and assets management [17]. Also, Tumer describes the future construction site utilizing industry 4.0 [61]. Without a doubt, a future working site should be fully benefited from AI [62, 63, 64], automation technology [65], Simultaneous Localization and Mapping (SLAM) [66], and IoT [67, 40, 68, 41]. Therefore, the uncertainty degree of construction sites is reducing, and thus I consider the construction sites as known in my research.

Based on the strict literature review, A<sup>∗</sup> is the most developed and latest algorithm to solve construction sites' path tasks. It combines both step-cost calculation from the Dijkstra algorithm and feedback step from the genetic algorithm. However, as the number of machines increases, naive A<sup>∗</sup> might not be suitable to solve the MAPF problem due to algorithm complexity. Thus, it is necessary to explore the SOTA solutions in the field of mobile robotics in order to find a more appropriate solution.

### **3.2.3 MAPF for Mobile Robotics**

For a single agent, the planning task can be described as finding the lowest cost from the starting point to the targeting point. By using Heuristic Search, e.g., A<sup>∗</sup> Algorithm, such problem can be better solved. However, naive A<sup>∗</sup> Algorithm only considers that all the obstacles are static, which is the ideal assumption in the pathfinding problem [69]. In contrast, to solve the MAPF problem, the other participants in the map must be considered, i.e., the dynamic obstacles also affect the optimal path of an individual agent.

MAPF for mobile robotics is both a well-studied and dynamic developing topic. The usage scenarios of MAPF and their corresponding algorithm are diverse, such as warehouse [39], computer games [70], and autonomous driving in intersection [71]. Until the end of 2020, although the concept of reinforcement learning is attracting more and more attention, current influential research about the shortest path and MAPF is mainly based on graph theory. To date, there is no universal solution for all kinds of pathfinding problems; thus, algorithms with different time and space complexities are proposed [72].

Based on the survey paper from Felner [73], it is known that no algorithm dominates all others in all circumstances. There is a tradeoff between high-quality path solutions and realtime performance. The mainstream MAPF solution can be classified into search-based solvers and rule-based solvers. The former intends to find the best solution or near-optimal solutions, whereas the latter can run much faster, however, produce far away from optimal solutions. Of course, some compromised solutions combined two ideas together, namely, hybrid approaches. Another type of solution, namely reduction-based optimal solvers, focusing on reducing MAPF problems into some problems, such as the Constraint Satisfaction Problem (CSP), with a well-known solution. Since this approach usually only aims at the makespan tasks, I will not go much deeper in this approach. To date, the most influential solutions for MAPF are Conflict Based Search (CBS) and its variants due to their widely used real-world applications.

Sharon proposes Conflict Based Search (CBS), combining both advantages from coupled and decoupled approaches. Although the pathfinding process is strictly single-agent searches, it can guarantee to offer optimal results, unless the variant that deliberately provide a suboptimal solution for the purpose of realtime performance. As the authors introduce, CBS adopts a two-level structure, where the high-level search can be described as a Constraint Tree (CT), including every constraint. Then, the lower level finds the concrete path for each agent individually with the information from the higher level. The brilliance of this design is that the search process is not more exponential in the number of participants but exponential in the amount of conflicts encountered during the pathfinding process.

Since CBS tries to find the optimal solution and thereby causes a relatively longer runtime, in improved CBS algorithm [74], Boyarski summaries two methods to reduce the runtime. Concretely, it firstly adopts the Meta-Agent CBS (MA-CBS) [4], which merges multi-agents together and handles them as a large agent. Moreover, it uses bypass improvement, which encourages one of the agents to find an alternative path instead of performing a split at the high level. As mentioned by Boyarski, the bypass concept successfully avoid the unnecessary generate the new nodes in the CT [74]. Since the bypass concept only tries to find the solution from the path with the same cost as the one that shall be replaced, the optimality cannot be harmed. In the high-level search, Felner suggests adding heuristics into CBS so that the conflicts are not arbitrarily chosen [75]. After that, Li found the improved heuristics to guide the high-level search [76]. Also, She introduced the CBS with disjoint splitting [77]. The main contribution of CBS with disjoint splitting is the novel terminology of positive constraints forcing the a<sup>i</sup> to be at v at timestep t. In this fashion, CBS with disjoint slitting reduces the amount of unnecessarily expanding the CT. In addition, some improvements, such as Lazy CBS, which avoids the behavior that CBS resolves the same conflicts between the same pairs of agents many times owing to lack of connection among subproblems [78]. Hönig proposed an approach called Conflict-Based Search with Optimal Task Assignment (CBS-TA) [79]. The improvement is mainly because it creates the forest on demand. Solving MAPF optimally is proven to be NP-hard, so CBS and all other optimal solvers do not scale up. Alternatively, Barer proposed a suboptimal variant of the CBS algorithm [80] so that the problem can be solved suboptimally but much faster.

To sum up, naive CBS is an optimal pathfinding solution that is based on graph theory. The time consumption of the algorithm mainly depends on the conflicts occurring among the agents since they increase the nodes in the high-level tree. Its performance on bottlenecks and corridors is better, whereas the performance on open space can be worse than enhanced A<sup>∗</sup> . Thus, CBS's variants are focusing on reducing the nodes in the CT and therefore let CBS has a higher success rate in general. Note that in this chapter, I use the same terminology as Stern's research to avoid misunderstanding; however, some mathematical descriptions might be adjusted.

### **3.3 Problem Statement**

Although the path planning problem has been attracted engineer's attention, the proposed method is quite tricky to be used in a large construction site with many machines. Theoretically, given a 4 neighborhood movement model and an undirected graph G(V, E), the branching factor should be estimated as (E/V ) k and the search space is V k if k machines should be planned. For 20 agents, considering that a normal working site with 500m×100m where 50×10 cells are needed, the search space goes to (50 × 10)<sup>20</sup> = 9.54 × 10<sup>53</sup>, which is unsolvable within acceptable duration for a real application even if the top CPU in 2020 achieving approaching 200 GFLOPs is used. Thus, using the traditional methods to solve the cooperation task is still challenging. Also, some algorithms are based on replanning the path if agents encounter a head-to-head position. However, these methods require excellent perception capability and limit the movement velocity of agents.

On the other hand, CBS based solution treats all the objects equally. Concretely, each robotics has the same capability, the cell on the ground is assumed as equally challenging to be overcome. However, it does not hold true in a working site; some machines should be assigned a higher priority, and some paths are much easier to pass. For instance, larger machines are usually more challenging to control their velocity. Also, stopping on a slope is much dangerous than on the flat ground. Hence, in my study, I further develop the original CBS so that it can deal with plenty of priority problems in a real working site. Also, the computational time should be much shorter to handle emergence.

### **3.4 Model Building**

Nowadays, with the development of SLAM regarding visual recognition, IoT, and satellite technology, the uncertainty degree of construction sites is reducing. Thus I consider the construction sites as known in this chapter. In most instances, the MAPF solution can be evaluated twofold. The first one is sum-of-costs

Figure 3.2: A grid map with terrain weight based on a real construction site, drawn by Liu [81]. In this map, the green, orange, and red grids demonstrate the easy, normal, and difficult to pass terrain, separately. The blue cells denote the place where it is considered impossible to pass, such as occupied by the obstacles. On a real construction site, the obstacles can be the place to store construction materials temporarily.

which describes the accumulative cost of all the agents. Such costs can be time consumption, fuel consumption, or some other objective goals. The other way to analyze the performance of MAPF solvers is makespan, indicating the maximal time the last goal has been achieved. Obviously, unlike the robotics in warehouses, working machines perform a relatively long period to do their duty after they have arrived. Thus, makespan is not so vital compared to sum-of costs in the field of construction or milling machines. Consequently, I adopt sum-of-cost as the evaluation criterion rather than makespan.

For working site MAPF problems, the problem can be described as given a graph, G(V, E), and a set of k agents labeled as a<sup>1</sup> . . . ak. Each agent a<sup>i</sup> has a start position s<sup>i</sup> ∈ V and goal position g<sup>i</sup> ∈ V . Based on the practical conditions in a working site, I consider vertex conflicts, edge conflicts, and swapping conflicts as unacceptable conflicts, whereas following and cycle conflicts are allowed in my study. Formally, the unacceptable conflicts are described as,

$$\begin{cases} \pi\_i[t] = \pi\_j[t], \\ \pi\_i[t] = \pi\_j[t+1] \cup \pi\_i[t+1] = \pi\_j[t] \end{cases} \tag{3.1}$$

where π<sup>i</sup> and π<sup>j</sup> are the single-agent path for a<sup>i</sup> and a<sup>j</sup> at time step t, correspondingly. The first equation shows the vertex conflict whilst the second equation describes the swapping conflict. Apparently, edge conflict is a union of vertex conflict; thus, no additional equation is needed. Intuitively, π1[2] denotes the location of the first agent at the second time step.

#### **3.4.1 Multi-Layer Grid Map**

Unlike a standard graph problem which adds the weight on the edges, I add the weight directly on the grids. This is mainly for three reasons. First and foremost, the machines usually occupy a relatively large area and thus should not be modeled as a simple vertex and ignore their geometry. Also, most previous studies in the field of construction machines used the grid-based map. To guarantee compatibility, I tend to use a similar solution since no approach from them obviously outperforms the other one. Last but not least, if weights are applied on edges, it becomes challenging to penalize the waiting process.

Inspired by the research from Fankhauser [82], a map can include many layers to store different types of data information. Obviously, the map information should be saved in BIM system so that the path planning process can be done. In the previous study [66], Xiang developed a realtime map plotter of the construction site according to ground condition, offering multi-layer grid-based maps, which divide the environment into uniform cells. Moreover, maps can also be gathered by Lidar or cameras provided depth information installed on a drone or on the ground.

Fig. 3.3 illustrates the multi-layered grid map concept, where each cell data is stored on the congruent layers. In many construction projects, since resistance and grade of the road are the most of importance information for the construction

Figure 3.3: An example of multilayered grid maps. My approach depends on multilayered grid maps to offer data of different types of information to make the best path. Concretely, every grid saves a 1\*3 matrix, including location information and the corresponding terrain information. The map, which can be visualized in the BIM system, is saved as a 2\*m\*n\*3 tensor, where m is the max displacement in the x-direction, while n is the max displacement in the y-direction. In case a place is unknown, it will be marked as NaN to denote the uncertainty of the regions and be treated as obstacles.

Figure 3.4: Detail description of a layer in the grid-based map.

machines, I show a two layers grid map as an example. Concretely in my study, the map is divided into small cells, whose resolution is 10 meters per cell to cover the geometry of the vehicles. In practice, GPS/IMU based Kalman filter algorithms can be used to locate the mobile machine on the construction site. To describe the ground condition of construction sites, I use the value of each cell to represent the information of the ground situation. In Fig. 3.4, a layer that holds the data of a grid-based map is shown. Apparently, although I only demonstrate the map with two layers, it is relatively easy to extend the third layer in case more information should be taken into account for the path planning because I can simply add the weights together.

#### **3.4.2 Lower Level Search**

The concept of CBS does not limit the lower-level search algorithm. In addition, since negative weight cannot happen in 3D space, I believe that Dijkstra and A<sup>∗</sup> can work well for individual shortest path search. In light of that, I use best-first search<sup>2</sup> . Also, to accelerate my algorithm, I limit the moving direction of an agent to 4 and thus reduce the branching factor to 5, including wait, instead of 9 or more. In order that I can get the optimal solution, I set the heuristics smaller than the real distance since weights are considered. Because grid map is used, I use Manhattan distance as the base of the heuristics to guide the search.

In the lower level, the algorithm searches the best path of individual agent based on the estimated cost of through current vertex to the goal, formally,

$$f\_{f,r}(n) = \sum\_{i=1}^{p} W\_{L,i} \cdot g\_{i,f,r}(n) + h\_{f,r}(n) \tag{3.2}$$

where f(n) is the estimated cost from its source through current vertex n to its goal, g<sup>i</sup> denotes the real cost from the source to the current vertex n considering the ith weight-grid map, and h denotes the estimated cost from the current vertex

<sup>2</sup> The concept that searches the most likely region first.

Figure 3.5: The basic requirements of the path planning algorithm. As shown in the left subfigure, the vehicle should take the lowest-cost path to reach its goal. The middle subfigure shows that the vehicle with lower priority should wait until the vehicle with higher priority pass through if there is no other bypass possibility. Last but not least, the right subfigure indicates that a good path should not be too close to dangerous objects.

n to the predefined goal based on Manhattan distance. W<sup>L</sup> is the weight for the specific layer. The index f and r show the estimated cost is forward or backward.

Apparently, planning the best path for machines to reach their goal is a multiobjective task. Generally speaking, the evaluation criterion can be divided into subjective and objective criteria. Obviously, the objective criterion demonstrates the objective criterion of the planned path, especially the terrain which can affect safety and efficiency. As some roads inside the construction site can be built with asphalt, so is considered better road conditions than some road made of sand. Consequently, the cost of passing different routes is different. Besides that, the road slope should be taken into account since waiting on a steep hill is much more dangerous than staying on flat ground. Therefore, I introduce multi-layer to record the individual characteristics of the terrain and plan the best path based on them. Concretely, a construction site map is divided into a series of cells and layers, according to different criteria. The weights of individual cells in one layer are saved as shown in the following matrices.

$$
\tilde{W}\_{m,n} = \begin{pmatrix} w\_{1,1} & w\_{1,2} & \cdots & w\_{1,n} \\ w\_{2,1} & w\_{2,2} & \cdots & w\_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ w\_{m,1} & w\_{m,2} & \cdots & w\_{m,n} \end{pmatrix}
$$

In contrast, the subjective criterion may not harm the whole system's actual performance; however, it has an impact on people's psychology. For instance, a crane or some other dangerous objects, such as a power station, on a working site should be protected, and the situation that the mobile machines unnecessarily approaching them should be avoided. As we know, even machines did not involve in an accident, getting close to a dangerous object will be stressful for site managers and indicating a potential risk. To address this problem, I add g<sup>3</sup> to penalize the machines for occupying the areas surrounding these special objects, see Eq. 3.3,

$$g\_3(n) = \sum\_{o=1}^{r} \frac{C\_o}{|(X\_n - X\_o)| + |(Y\_n - Y\_o)|}\tag{3.3}$$

where [Xo, Yo] is the position of the objects which should try to avoid being approached, C<sup>o</sup> denotes the intensity.

#### Algorithm 1 Bidirectional A<sup>∗</sup> Algorithm at Low Level to Speed up the Searching Process

Input: G(v, t), s, ˜ g˜ from predefined map information in yaml, original from visual or Lidar recognition

Output: P ath, dshortest Initialisation :


```
LOOP Process
5: while OpenSet[0], OpenSet[1] not empty do
6: vc ← ExtractM in(dist), forward otherwise distR
7: OpenSet.remove[0](u)
8: if neighbor(u) = valid then
9: if u not in ClosedSet[0] then
10: g[0][u]tmp ← g[0][v] + StepCost
11: end if
12: if neighbor not in OpenSet[0] then
13: OpenSet.append(u)
14: else if g[0][u]tmp > g[0][u] then
15: continue
16: end if
17: CF[0][u] ← v
18: pif ← |
              −−−−−−−→
              d(v) − d(s)|, pir ← |
                                 −−−−−−−→
                                 d(t) − d(v)|
19: hf ← (pif − pir)/2
20: hr ← −pf
21: f[0][u] ← g[0][u] + hf , if forward otherwise hr
22: end if
23: if u in ClosedSet[0] then
24: break
25: end if
26: ClosedSet[0].append(u)
27: repeat symmetrically for v
                           R as for v, where OpenSet[1] and ClosedSet[1]
     should be used
28: end while
29: distance ← ∞, ubest ← None
30: for u in ClosedSet[0] + ClosedSet[1] do
31: if dist + distR < distance then
32: ubest ← u
33: distance ← dist[u] + distR[u]
34: end if
35: end for
```

Figure 3.6: Illustration of the benefit of bidirectional search with heuristics. I use squares and triangles to represent the start point and the goal point separately. Here the grey region denotes the whole region of the map, and the green region shows the region that the algorithm must search before finding the shortest path for agents. Apparently, bidirectional search with heuristics cannot be slower than its antecessors.

To accelerate the low-level searching process, I utilize the bidirectional A<sup>∗</sup> algorithm with a path return for the initial search. The following Algorithm 1 shows the algorithm's steps, where CF is the dictionary to save the path sequence. Since bidirectional A<sup>∗</sup> is not suitable for dealing with the waiting operation, neighbors are valided if the grids are not occupied by obstacles, i.e., ignoring the conflicts with other agents. The index 0 in the algorithm denotes forward, whereas index 1 means backward.

After the initial solution is proposed, I adopt unidirectional A<sup>∗</sup> to solve the conflicts among the agents. Also, the priority of the agents is taken into account here. As a conflict occurs, the algorithm will give the order which agents should avoid other higher priority agents. Unlike some other research using hard priority, which might lead to the algorithm become not complete, I adopt the soft priority, namely penalization function, to guarantee the algorithm to find the feasible solution if the problem is a solvable MAPF problem. Since the A<sup>∗</sup> algorithm is well known, I only give the part that involves in the priority of the agents, see Eq. 3.4,

$$g[u]\_{tmp} = g[v] + P\_{a\_i} \cdot \sum\_{k=1}^{L} StepCost \tag{3.4}$$

where u is the neighbour point and v is the current point. L is the total layers of the map, and P<sup>a</sup><sup>i</sup> is the priority value of the agent i.

#### **3.4.3 Higher Level Search**

Although the papers about CBS and its variants showed the success rate of the algorithms under various specific scenarios, they are performed with a time limit of at least 1 minute. Considering that some machines might not maintain their speed and emergencies may happen, I believe a feasible algorithm for the construction site should give a command to all the participants within 5 seconds even the order can be just wait.


Concretely, I let some machines have priority to move to their goal while the others should wait for a while or find a midway destination in case when the task is too complicated for a realtime response. Or the algorithm suggests to reduce the total mount of machines on site if necessary. Different from the original CBS algorithm, I add four different strategies in case the algorithm cannot find a solution for all agents within 5 seconds. The basic ideas of these acceleration process are reducing the complexity of the task by solving the problem step by step and described as follow,


The individual path will be compared at the high-level search to find out the conflicts among the agents. I use bidirectional A<sup>∗</sup> algorithm with the heuristics proposed by [83] to create the initial path of each agent in order to enhance the realtime performance, and afterward utilize unidirectional A<sup>∗</sup> to update the individual path of each agent since only unidirectional A<sup>∗</sup> can deal with the waiting process. In the BIM system, during the searching process, the algorithm saves the conflicts position and the corresponding agents. As mentioned, in case the algorithm cannot solve the planning problem due to too many conflicts, the algorithm will remove some agents and then replan the paths of the rest agents. In this fashion, I ensure that the known MAPF problems can be solved in a timely manner. If emergence occurs and the algorithm cannot solve the new task in time, I can easily and quickly locate the trouble maker. Apparently, the optimization direction here is not only to ensure a short calculation time, but also to make as many machines as possible move at the same time.

## **3.5 Experiment on Real Working Sites**

As a consensus of the research in the field of graph theory, although a conclusion about a specific map cannot guarantee its effectiveness on another map, the closer the map, the closer the effect. In order to show the benefit of the introduction of MAPF in the mobile machines industry, I validated my algorithm on five typical real working sites. Concretely, they are a relatively open field with 20 or 50 agents, an open field with many obstacles, a two-side working site connected by a bridge or narrow corridor, and a typical mining site. Since the map will also be shown in the path results as background, here I only demonstrate how a map will be processed to give the prior information for successfully pathfinding on the first map and third map to avoid repetition. The maps shown in Fig. 3.2 and Fig. 3.7 are on the same site at different times. Obviously, Fig. 3.2 is the earlier stage while Fig. 3.7 shows the later stage as the construction process proceed since more facilities are there. In my experiments, the dimensions of the proposed maps are 20 × 13 and 17 × 12. For the sake of simplicity, I also assume the velocity of all the machines are constant; however, it is surely easy to achieve the situation that the machines have quite different speed since I can use the fastest speed as a reference and allow the slower agents occupy more than one grid at the same time or vice versa.

Obviously, the faster the project, the faster the construction site changes. This indicates the difficulties of using a pre-calculated path planning for a construction site. In this study, I divided the grids into different regions with respect to whether the road is easy to be passed through, the slope of the road, and whether the place is safe.

Table 3.1: Weight table. The weights I use to describe the complicated terrain of construction sites


Figure 3.7: Map example drawn by Liu [81]. Another map based on a real construction site which has more narrow corridor.

Here I show the solution finding time on CPU core i7 4720HQ@ 2.6 GHz. Because of its reasonable price at the end of 2020, it is suitable for large-scale commercial use. To reduce the randomness, I did the experiments 50 times and gave the average finding time, and the average number of conflicts occur to analyze the conflicts and thus show the rationality of my optimization. Notice that I rounded the numbers to one decimal place if any.

Before I analyze the results of my experiments, I summarize the basic ideas of the algorithm I used. Similar to the original CBS algorithm, my MAPF algorithm also adopts a two-level search, where the upper level finds the conflicts among the agents and the lower level search the best path for individual agents. The lower level finds the path first and sends the initial proposal to the upper level. Afterward, the upper level will check whether the planned path has a or many conflicts with others. In case there are no conflicts, the center commander, an AI system, agrees to the preliminary proposal to become the final solution of MAPF and all agents are allowed to execute this solution. In other cases, if there have some conflicts, the upper level will find out the conflicts and send this information as constraints to the lower level to avoid the conflicts. To generate the initial individual path for each machine faster, I use bidirectional search. And then update the individual path if the upper level finds out a conflict with unidirectional A<sup>∗</sup> algorithm. The algorithm tries to modify the solution having the lowest cost, which guarantees the solution to be optimal. In the experiments, I do not assume what the participants are, nor do I assume its working process to ensure the generalization of my method.

## **3.6 Experimental Results**

In this section, I demonstrate the planned path for each map in Fig. 3.8. As we can see, my algorithm successfully finds out the optimal paths considering the priority of the machines, i.e., the path with the lowest cost considering the main criterion, for all the tested maps. The algorithm commands the machines to drive directly to the goal, find a bypass, or just wait for others first to pass through.

In Tab. 3.3 on page 46, I give the computational time to find out the optimal solution. I divide the searching time into initial search and the following update process. In the first phase, the computational time is no more than 0.1 seconds on the tested maps. In case that the MAPF task is easy, i.e., the counteraction and potential conflicts among the agents are rare, the update process can also be done very fast. As we can see, the total duration to get the optimal solution is within 0.2 seconds for the scenario of agents on map 1. However, in other cases, such as the MAPF tasks on map 2 and map 4, although the period to offer the initial path proposal has no significant difference, the total duration is quite different. Concretely, the tasks on map 2 and map 4 need about 6.8 and 10.4 seconds to be solved. For such tasks, the solution can of course be found inside the BIM system before the machines execute their order saved in the schedule file. In the ideal case, the computational time for finding the solution is not the critical thing. However, in a real application, it is normal that the participants do not act on time when something urgent happens; thus, the ability to replan the path quickly is particularly important rather than let all the machines wait in place. As shown in Fig. 3.11 and Fig. 3.12, the update process is a dominant part of

Figure 3.8: The planned path for each map. In the clockwise direction, the subfigures demonstrate the final solutions, including the best path for each machine for maps 1, 3, 4, 5. The left bottom point is defined as the original point (0,0), and the horizontal axis is the first axis. The layout of map 2 is the same as map 1; however, the difference is that there are more agents on map 2. Due to its huge amount of information, I give the planned schedule in Tab. 3.2 instead of using figures.

the whole searching process. Also, with the data shown in Tab. 3.3, comparing the duration of the initial search and the following update process, it is easy to conclude that reducing the update process is the main optimization direction to make my algorithm faster.

In this chapter, I demonstrate the optimization mechanism on the MAPF task on map 2 and map 4 to avoid wordy; however, I confirm the conclusions I make are also in line with the other maps I tested. The results on map 2 and map 4 are shown in Tab. 3.3. The optimization depends on the stage of the construction


Table 3.2: The schedule of MAPF task on map 2. In this task, the algorithm should plan the path for 50 agents

Figure 3.9: The place where the agents intend to pass through and the resulting conflicts on the second map. As it shown from the points allocation, initial conflicts have a great relevance to the conflicts occurring during the conflicts avoidance process. Notice that I did the experiment 50 times and the conflicts shown in these figures are the average number of these 50 experiments.

site. In the early state before the site is built up in reality, engineers have more freedom to optimize. To accelerate the computation, two methods are proposed for the early stage. The first idea is to modify the unreasonable part of the construction sites. In this fashion, the throughput of the construction site can be improved. Fig. 3.9 represents the positions where the algorithm commands the machines to pass through but encounters conflicts with other construction machines. As aforementioned, I consider two kinds of conflicts in this study since they are more in line with the construction site, namely edge conflicts and vertex conflicts, respectively. Fig. 3.9(a) and (c) demonstrate the conflicts found by initial bidirectional search. Since the map is weighted, the best path is usually

Figure 3.10: The place where the agents intend to pass through and the resulting conflicts on the fourth map. As we can see from the points allocation, initial conflicts still have a great relevance to the conflicts occurs during the conflicts update process.

unique; this is partly proved by the fact that the conflicts number found by initial bidirectional search is a multiple of the time I did the experiments. However, the unique best solution increases the possibility of generating conflicts. Taking the MAPF task on map 2 for example, as shown in Fig. 3.9, it is shown that there are three regions that have more conflicts than others. Concretely, they are the region including vertex (16, 8) and (17, 8), the region including vertex (2,7), as well as the region including vertex (17,12) and (16,12). Correspondingly, based on the intended movement of the agents around these positions, the algorithm points out that the vertex (16,9), (17,9), (3,8), and (15,12) shall be modified to have similar characteristics as its surrounding. For instance, the road condition of the vertex (16,9) and (17,9) shall be changed into good condition from bad condition

Figure 3.11: Statistics of the conflicts made by corresponding agents on map 2. Here blue histogram denotes the conflicts found by initial bidirectional search, and the orange histogram shows the conflicts solved while updating the solution with unidirectional search.

Figure 3.12: Statistics of the conflicts made by corresponding agents on map 4.

since their surroundings have good condition. The computational time is then dramatically reduced to about 0.31 seconds and seems to be the most effective method to reduce the solution finding time. The results are demonstrated in Tab. 3.3.

The other idea is to remove the most troublemaker in the MAPF task. As we know, the capacity of each construction site has a physical upper limit. No matter how excellent the algorithm is, too many participants will eventually lead to a decline in overall performance. Compared to the first method, which has a potential drawback that there might be some reasons that optimization of the working site is not always feasible, removing a conflicts-causing agent can be used whenever needed. In the case of the MAPF task on map 2, agent 16 is removed according to Fig. 3.11 so that the computational duration reduces to roughly 0.97 seconds. Considering the holistic productivity is only marginally affected since there are still 49 machines that can work well in the working site, and the shorter duration endows the whole system the capability to deal with the emergence on site, this method is recommended if the construction site cannot be modified.

In the BIM system, I consciously made sure that the computational duration was within an acceptable period. However, this cannot guarantee all the potential conflicts for these MAPF tasks were removed. For the example of the MAPF task on map 2, in case that agent 16 did not catch up with the planned schedule and had a two-time step delay, the duration for replanning the solution for the whole fleet went to more than 5 seconds, even the construction site was modified in the early stage. Since I had already optimized the searching process in BIM so that the duration is shorter than 0.5 seconds and the only thing changed here was agent 16, I could quickly draw a conclusion that the expanding computational time was due to agent 16. Afterward, the algorithm checked the path of other agents and found out a new temporary destination for agent 16 to avoid conflicts. Concretely, the new goal for agent 16 shall be the vertex (15,7), and the computational time was then only 0.42 seconds. Notice that agent 16 is not allowed to stop at its original place since it blocks the only way for agent 17 and 14 to reach their goal. In case that machine 16 totally lost its mobility, the algorithm will ask every participant to stop and wait for the human intervention. In contrast, If it is not on time due to external distributions, it will wait for other agents to arrive and then continue to its original target.

In the MAPF task on map 4, I use the same methods to optimize the computation time in order to demonstrate my solution's generalization capability when the terrain is more complex and there are fewer agents. As shown in Fig. 3.10, 2 regions have more conflicts than others, i.e., the region including vertex (8,6) and vertex (9,6), and the region including vertex (6,10). Same as the method, namely



construction site optimization, used in the previous case, the vertex (5,10) is indicated, which shall be modified from an obstacle to a road with good conditions. In addition, the road condition of the vertex (9,5) and vertex (10,6) shall be changed into good condition from bad condition. After this optimization, the computation time is greatly reduced to roughly 0.17 seconds. For agent optimization, the most conspicuous troublemakers in this task shall be removed, which are agent 10 and agent 11, as shown in Fig. 3.12. Here I removed the agent 10 so that the computation duration reduces to about 0.17 seconds. Although the optimization results of these two methods in the MAPF task on map 4 are almost the same, I recommend layout optimization because only 12 agents were deployed in map 4. If an agent is removed, the overall productivity will be reduced more significantly compared to the previous scenario on map 2. In case that agent 10 did not run perfectly according to the planned schedule and had a delay of one-time step at the beginning, the computation duration for replanning can also exceed 5 seconds, even if the layout of the construction site was optimized with the first method. The algorithm gave a new goal for agent 10, which shall be the vertex (7,4), and the computational time was afterward 0.15 seconds. According to the above results, the algorithm is also proven as effective for the task on map 4.

## **3.7 Advantages of My Methods**

In this study, my approach enables many machines to work simultaneously inside of the working site. Firstly, the method helps civil engineers to arrange the construction site before the site is setup. My approach points out the positions where conflicts occur among the machines and thus indicates the place worthy of being modified. Moreover, my algorithm also helps the engineers to determine the reasonable number of machines in a working site, on the premise of using advanced algorithms. In addition, my algorithm schedules a conflict-free solution for the agents so that the agents can move confidently without hesitation. Last but not least, since the emergencies are inevitable, I design the system to replan the path solution in a very short period compared to the SOTA MAPF algorithm with only slightly increase the non-optimality of the solution.

## **3.8 Conclusion**

In this chapter, I presented an efficient and effective algorithm to calculate the path of a fleet of machines on a construction site. Considering the complicated terrain of a construction site, I endow my algorithm with the ability to handle the weighted maps. By testing my method on five different and diverse maps, my method successfully found the best path for a fleet including participants with different importance. By solving the MAPF problem for a construction site from both algorithmic and construction layout perspectives, I showed the benefits of my hybrid method, especially in reducing the computational time to handle emergencies. Based on my results, modify the unreasonable part is the most efficient fashion to speed up the searching process. Also, removing the agents which cause the most conflicts is always viable and can dramatically reduce the searching time but slightly reduce the whole productivity.

# **4 SLAM for Machines on a Smart Working Site<sup>1</sup>**

The decision of a reasonable strategy for machines on a working site is not only determined by its intrinsic signals, but also very strongly by environmental information, especially the terrain. Due to the dynamically changing of the construction site and the consequent absence of a High Definition (HD) map, the SLAM offering the terrain information for construction machines is still challenging. Current SLAM technologies proposed for mobile machines are strongly dependent on costly or computationally expensive sensors, such as RTK GPS and stereo cameras, so that commercial use is rare. In this chapter, I proposed an affordable SLAM method to create a multi-layer gird map for the construction site so that the machine can have the environmental information and be optimized and directed accordingly. Concretely, after the machine passes by, I get the local information and record it. Combining with positioning technology, a map of the interesting places of the construction site can be then created. As a result of my research gathered from Gazebo, I showed that a suitable layout is the combination of 1 IMU and 2 differential GPS antennas using the unscented Kalman filter, which keeps the average distance error lower than 2 m and the mapping error lower than 1.3% in the harsh environment. The SLAM technology proposed in the chapter provides the cornerstone to activate the pathfinding solution proposed in the previous chapter.

<sup>1</sup> Except some tiny modifications, all the figures, text, and results of the presented work in this chapter have been published in my preprint publication [66]. My contribution to the paper is summarized as 100% in terms of conception and methodology, 90% of literature review, 50% of model building and simulation, 50% of results visualization, and 95% of formulation.

## **4.1 Introduction**

The environment also has an essential influence on the performance of a fleet of working machines, i.e., to perform tasks both efficiently and safely, the construction machines shall be conducted by knowing their location and surroundings; thus, I proposed a method that can generate the map information surrounding the mobile machines only with commodity sensors so that provides the possibility to improve the system further. The basic idea of my approach is to generate the map information of the working site based on the vehicle position, rolling resistance, as well as road grade. Concretely, a special recursive least square with forgetting algorithm is used to record the road grade and the rolling resistance in realtime [84, 41]. These information will be saved together with the localization information. Consequently, after the machine passes by, it will record the information about that place. Since the mobile machines are driving repeatably for a special task, the method can be expected to work well even when the map information does not cover most of the working site. Fig. 4.1 illustrates the motivation of my approach. Safety first

Figure 4.1: Mobile machines perform tasks more efficiently or safer according to their location and surroundings information. The short-term goal of SLAM is to prevent construction machinery from always working in low-efficiency areas for safety reasons, whereas the long-term goal is to increase the productivity of the working site with the help of path planning. The chapter focuses on affordable SLAM technology for construction machines.

### **4.2 Problem Statement**

Although the map information can also be obtained from satellite, it is impossible to get the valuable information, such as a HD map, only depending on remote sensing due to a construction site's fast-changing environment. Also, the sensors can be quite noisy. Especially, they will be further exacerbated on a working site. The sensors, such as Global Positioning System (GPS), Inertial Measurement Unit (IMU), and odometry sensor will have higher measurement errors in case of the harsh environment. In addition, different construction machines have different drivetrain system, which makes a predefined motion model difficult. For instance, since mobile machines may work outside the coverage of base stations, the GPS signal can only achieve nearly 10 m accuracy [85] without signal correction. Last but not least, for passenger cars, the longitude error might not have such a negative effect as the latitude error since further measurements can be adopted to avoid the collision. In the case of construction machines, both errors shall be treated equally.

## **4.3 Goal of This Chapter**

The goals of this chapter are twofold. The first goal of the chapter is to find out the most suitable sensor arrangements for construction machines. For accurate estimation of the position of machines, rather than only trust the measurement from one sensor, I fused a series of different kinds of sensors with the help of sensor fusion technology, derived from Kalman filter [86], to achieve better accuracy. Afterward, the second goal is to create a map with the environment condition by combining the surface resistance, road grade, and position information in realtime. Thanks to this map, further optimization of operation strategy and path planning can be realized. Although I suggest to measure the surface resistance and road grade by recursive least square with multi forgetting factors, the map-building approach I proposed can also be combined with other methods with other kinds of sensors, such as using ultrasonic proposed by Jung [87].

## **4.4 Related Works**

#### **4.4.1 Sensors**

The combination of several sensing systems so that they can compensate the technical shortcoming of each other is well-known in the field of autonomous systems [88, 89, 90]. Therefore, there are a series of researches focusing on sensor fusion. Here I first summary the commonly used sensors for simultaneous localization and mapping (SLAM). Although I agree the introduction of the HD map can surely increase the accuracy of the localization, I do not consider this technology for the construction machines due to the dynamically changing of the construction site, as mentioned in [91].

#### **4.4.1.1 GPS**

Global Navigation Satellite Systems (GNSS) such as GPS, GLONASS, BeiDou, and Galileo rely on at least four satellites to estimate global position at a relatively low cost. Typical standalone GPS average accuracy ranges from few meters to above 20 m [85] due to ionospheric delay, multipath effects, ephemetrics & clock errors, and Geometric Dilution of Precision (GDOP). To improve the accuracy, one of the most used techniques is Differential GPS (DGPS), which utilizes measurements from an onboard vehicle GPS unit and a GPS unit on a fixed infrastructure unit with a known location. Here the known fixed infrastructure unit is called reference station, which calculates the local error in the GPS position measurement periodically. The onboard vehicle GPS units then use this correction to adjust their own GPS estimation. According to [92, 93, 94, 95, 96], an average accuracy in the range of 1–2 m can be achieved, mainly depending on the distance between the vehicle and the base station. Another commonly used improvement is Realtime Kinematic (RTK) GPS, which estimates relative position by means of the phase of the carrier signal and can be expected to achieve centimeterlevel accuracy. Notice that, both of them depend on a fixed base station with a known position nearby, through the principle of them is quite different. In Oct 2020, when I wrote the content of this chapter, RTK GPS is still an extraordinary expensive approach<sup>2</sup> and usually be used to define the ground truth position of vehicles . Some low-cost RTK GPS sensors, under 1,000 bucks, are designed with much lower receive frequency [97] and thus cause problems as vehicles driving fast. Thus, in most commercial uses, DGPS is preferable for reducing the cost. Therefore, in my research, I conservatively considered the accuracy of DGPS as 2 m, which is consistent with the normal performance of DGPS. Obviously, as the performance, especially the accuracy, of the GPS increases, my mapping approach will also have better performance consequently.

#### **4.4.1.2 IMU**

Inertial Measurement Units (IMUs) are integrated electronic devices that contain accelerometers, magnetometers, and gyroscopes. It can provide raw IMU measurements to calculate attitude, angular rates, linear velocity, and position relative to a global reference frame.

#### **4.4.1.3 Odometry**

Odometry is the most widely used navigation method for positioning; it provides good short-term accuracy, is inexpensive, and allows very high sampling rates. Odometry is based on simple equations, which hold true when wheel revolutions can be translated accurately into linear displacement relative to the floor. The main advantage of odometry is that all localization information comes from the vehicle itself so that this information is always available. Usually, it is the only localization information when other sensors are not able to provide data. Thus, a good odometry based localization system is always necessary, and it is usually the first step to localization [98].

<sup>2</sup> For example, the Trimble R10 costs 18,000 US Dollars on Alibaba.

#### **4.4.2 Localization Technologies**

#### **4.4.2.1 Mobile Robotics**

A series of researches using sensor fusion to achieve a highly accurate localization has been studied worldwide. The technologies about SLAM can be roughly divided into two parts: indoors and outdoors. For indoor localization, such as domestic robots [99], a GPS system cannot be used. However, the road is relatively flat, and thus only a two-dimensional map is needed. In contrast, when it comes to offroad navigation in rough terrain, the algorithms must be capable to handle three dimensions of the environment. After the success of Kalman filter [86], extended Kalman filter [100], and finally unscented Kalman filter [101, 102], the idea that a mobile robot which executes useful missions should be endowed with navigation ability has become a consensus. However, the selection of combining different sensors is from case to case different. For instance, Bento fused the data from ABS sensors and GPS for outdoor localization, based on extended Kalman filter [103], and Zhang integrated the information from GPS and IMU [104]. Also, Li used a camera instead of GPS to accomplish mean positioning errors of 75 cm [105]. In addition, Wolcott proposed a Visual Localization method within LIDAR Maps for Automated Urban Driving [106]. For the cost purpose, Ward studied the possibility to use radar to localize the vehicle's position and demonstrates that errors go to 27.8 cm laterally and 115.1 cm longitudinally by their approach in worst case [107]. To investigate the use of LiDAR for localization, Hata suggests using LiDAR to detect curbs and road markings to create a feature map of the environment and localize vehicles with the help of RTK-GPS and IMU within the map [108]. Another alternative is using ultrasonic sensors, proposed by Jung [87]. Interestingly, as the development of the IoT, more and more researchers are focusing on SLAM by cooperative localization techniques. The basic idea of this approach is to get crucial information even when the perception capability is affected by adverse weather or obstacles from infrastructure or other vehicles. For example, del Peral-Rosado showed the feasibility of 5G based localization technology [109], and Rohani utilized VANET to enhance GPS accuracy to 3.3 m mean level [110].

#### **4.4.2.2 Construction Machines**

For mobile construction machines, the requirements for localization techniques can be different based on different use cases. Some machines may work in the underground, where the situation is similar to working in a tunnel, whereas others might work on an open-pit mining site. In underground mines, it was proposed to use the laser for extracting the wall positions, and dynamically generate a path from these laser data while considering variable offsets [111, 112]. In contrast, for the open-pit site, an autonomous wheel loader introduced by Gu [113] uses a set of sensors, including GPS and IMU for localization, and LIDAR, radar as well as a camera for obstacles capture and identification, ensuring it perceives surroundings accurately. Moreover, Xiang created a dataset for mobile machines detection from the view of a camera fixed on the ground [64], while Bang proposed a method recognizing the machines from a view of the drone [114]. In additon, the visual SLAM is proposed [115]. Besides that, V2X technology was introduced in the field of construction machines [67, 68]. In 2020, Xiang proposed to use WiFi to achieve the communication between different vehicles by introducing a realtime estimation method with respect to package loss and delay [116]. Afterward, the feasibility of using 5G for machines is also investigated by [62]. To avoid additional costs for the vehicles, smartphones show great potential to be utilized as a solution to complement the flaws of onboard ECU [117, 40].

In summary, similar to general autonomous vehicles, most automated mobile machines fuse information from onboard sensors such as IMU and GPS by using diverse sensor fusion technologies. Furthermore, camera, LIDAR, and radar are used to detect the environment on the construction site, to avoid obstacles, and to instruct the machines where to go. However, owing to the harsh environment and diversity of working sites, LIDAR and radar can be sensitive. In the recent future, wireless communication can also contribute to better localization of vehicles.

## **4.5 Model Building**

The wheel loader used in the simulation was modeled in Solidworks and then imported into the Robot Operating System (ROS) to explore the approaches that should be used for mobile machines. Since the first goal of this chapter is to find out suitable sensor arrangements to accurately localize wheel loaders, I fused different arrangements of sensor data. Concretely, I used up to three IMUs and three GPSs in the simulation. Based on the characteristic of GPSs, three GPSs were fixed on the cab of the wheel loader. I then installed two IMU sensors under the front axle and other IMUs under the rear axle, based on the suggestion from Li [118]. Fig. 4.2 illustrates the wheel loader model I used in the Gazebo environment.

Figure 4.2: Wheel loader model in Gazebo: once the models had been developed in Solidwork, they were converted to Unified Robotic Description Format (URDF), using a 3rd party URDF conversion tool called "sw\_urdf\_exporter", which allows for conveniently export SW Parts and Assemblies into a URDF file. Gazebo enables us to obtain sensors' simulation such as IMUs, GPSs, encoders, cameras, and stereo cameras through gazebo\_plugins, which can be used to attach into ROS messages and service calling the sensor outputs, i.e., the gazebo\_plugins create a complete interface (Topic) between ROS and Gazebo.

As we know, the URDF is an XML file format used in ROS to describe all elements of a vehicle. URDF can specify the kinematic and dynamic properties of a single robot in isolation. To make my vehicle works properly in Gazebo, additional simulation-specific tags concerning the vehicle pose, frictions, inertial elements, and other properties have been added. The transform tree is shown in Fig. 4.3.

Each Link in URDF represents a rigid body. Also, according to the kinematic and dynamic model shown in Fig. 4.3, the wheel loader in the simulation is divided into several parts, including,


In this project, the wheel loader receives GPS data from an onboard GPS sensor plugin with its latitude and longitude; however, the GPS data provided by the GPS plugin cannot be directly applied to the fusion of the sensor data, so coordinate system conversion for GPS data is required. For the simulation, I set a transform for each GPS that converts the vehicle's world frame coordinates, i.e., the frame with its origin at the vehicle's initial position, to the GPS's UTM coordinates, the same as [119], as

Figure 4.3: The dynamic system simulated by URDF file on ROS.

$$\mathbf{T} = \begin{bmatrix} c\theta c\psi & c\psi s\phi s\theta - c\phi s\psi & c\phi c\psi s\theta + s\theta s\psi & x\_{UTM\_0} \\ c\theta s\psi & c\phi c\psi + s\phi s\theta s\psi & -c\psi s\phi + c\phi s\theta s\psi & y\_{UTM\_0} \\ -s\theta & c\theta s\phi & c\phi s\theta & z\_{UTM\_0} \\ 0 & 0 & 0 & 1 \end{bmatrix} \tag{4.1}$$

where φ, θ, ψ denote the vehicle's initial UTM-frame roll, pitch, and yaw. c and s denote the cosine and sine functions, respectively. xUTM<sup>0</sup> , yUTM<sup>0</sup> , and zUTM<sup>0</sup> are the UTM coordinates of the first reported GPS position. After that, the GPS signal is then transformed into the vehicle's world coordinate frame, odom, by

$$\begin{bmatrix} x\_{odom} \\ y\_{odom} \\ z\_{odom} \\ 1 \end{bmatrix} = \mathbf{T}^{-1} \begin{bmatrix} x\_{UTM\_t} \\ y\_{UTM\_t} \\ z\_{UTM\_t} \\ 1 \end{bmatrix} \tag{4.2}$$

In my simulation the ROS package "robot\_localization" from Moore [119] was used, including a "navsat\_transform" node, which provides functions to convert between various coordinate frames and integrate GPS data. It provided a transformation function that allows the conversion between GPS frame, expressed in latitude and longitude, and vehicular coordinate. This process shall be carried out for each GPS independently.

In practical applications, GPS signals can be received infrequently. Yet the localization technology must maintain state estimation even when some of the vehicles' signals are absent. Therefore, the performance of the filters when GPS signals infrequently arrive in the system shall be evaluated. Taking this problem into consideration, I used a ROS node built by Li [118] to filter the collected GPS signals such that GPS data is unavailable for 1 second once every 10 seconds. In case multi GPS sensors are used, the signal failure might not happen at the same time. I aim to observe how the filter and my approach behave with different sensor configurations when some GPS signals go wrong.

### **4.5.1 Sensor Fusion for Localization**

For localization of the wheel loader in the simulation environment, I used EKF and UKF node in "robot\_localization" [119]. On the one hand, this package has no limitation for the number of sensor inputs, which just in line with my construction machines' requirements. On the other hand, a concrete motion model is not needed so that this method can easily be used on both excavators and wheel loaders with different drivetrain solutions. In "robot\_localization", the filter's state will be driven forward by a standard 3D kinematic model derived from Newtonian mechanics to calculate the vehicle's motion, including position, velocity, and acceleration in three dimensions.

In the correction step, the measurement model integrates sensor data to update the predicted state. GPS provides position information, and wheel encoders provide velocity information for the correction. Moreover, the orientation, velocity, and acceleration information are updated from IMUs via gyroscopes and accelerometers. Furthermore, the process model and measurement model also need to add a noise covariance matrix, Q and R, respectively. The noise matrices can generate uncertainty in the system. The process matrix contributes to the overall uncertainty in the algorithm, which adds to the process model. Intuitively, a large value in the Q matrix means a considerable uncertainty in the process model, causing the system to have greater confidence in the measurement data. In the current implementation, the process noise matrix was set as a diagonal matrix. The state variables, which are directly measured by the sensors, such as x y position by GPS and orientation by IMUs, were set relatively small. The variables, which were not directly measured, could be updated from the measured data. The measurement covariance matrix R corresponding to the confidence in the sensor data. Similarly, the greater the noise in the elements of this matrix, the less confidence in the measurement data. The measurement covariances are derived by the sensors noisy.

In addition to the inaccuracy of filters, outliers are also an important source of error. In my simulation, I assume that the measurements have Gaussian distributions. Although sensors follow the normal distribution as setting, improbable, and extremely noisy measurements can appear due to the high fidelity of ROS. To counteract this problem, I used Mahalanobis distance to detect outliers and thus overcome the consequent adverse effect. After this, the filtered data were used for state correction. Concretely, the Mahalanobis distance is calculated as a product of the from filter processed vector to find out the outliers,

$$D\_M(\vec{x}) = \sqrt{(Z\_t - \hat{Z}\_t)^T A^{-1} (Z\_t - \hat{Z}\_t)} \tag{4.3}$$

where A is the covariance matrix.

#### **4.5.2 Sensor Fusion Methods**

State estimation is one of the most critical issues in many autonomous applications. Having an accurate state estimation, the machines can be effectively navigated in the environment and thereby making optimal decisions for specific purposes. For instance, to reach a target destination, it needs to know its current state, which consists of position, velocity, acceleration, and heading to execute the right maneuvers correctly. Since sensors are susceptible to noise and imperfections introducing uncertainty to the measurements, the filter's goal is to fuse all the available sensor data, as well as the vehicle's own dynamics to obtain a more precise estimation of the vehicle's state. As mentioned, two necessary extensions of the Kalman filter are presented, notably the EKF and UKF.

The filters are modeled to improve the positioning's accuracy by compensating for the disadvantages of the different sensors. As we know, GPS provides relatively accurate positioning, but the signal's availability remains a problem, especially in urban and mountainous environments. This determines that the results of using GPS alone are usually not satisfactory. Also, IMUs use a combination of accelerometers and gyroscopes to measure linear accelerations and angular velocities, respectively. By estimating the position relative to its initial position, the trajectory can be calculated by these information of the wheel loader as the vehicle travels. For sure, there is also a common problem with IMUs, namely the accumulated errors. To avoid accumulated drift and provide global positioning, the estimated position shall be corrected by using other sensors, e.g. GPS. Without a doubt, a GPS/IMU system is successful in increasing the accuracy beyond standalone GPS or IMU capabilities.

#### **4.5.2.1 Extended Kalman Filter**

Although linear Gaussian systems are abundant, most systems, in reality, are non-linear. Also, they often do have Gaussian noise. Wrong assumptions about the system can lead the Kalman filter to diverge and provide estimation with very high errors. Consequently, multiple extensions have been developed to deal with various scenarios encountered in practice. One of the famous variations is the EKF, where it deals with non-linearity by approximating a linear equivalent before performing the required filtering sequence. The idea of the EKF is that if the system is close to linear for short periods, using its linear approximation will then not yield large errors.

Through linearizing the basic equations from Welch [102], the following equations are obtained:

$$x\_{k+1} \approx \tilde{x}\_{k+1} + B(x\_k - \hat{x}\_k) + Ww\_k \tag{4.4}$$

$$z\_k \approx \tilde{z}\_k + H(x\_k - \hat{x}\_k) + Vv\_k \tag{4.5}$$

where xk+1 and z<sup>k</sup> are the actual state and measurement vectors, x˜k+1 and z˜<sup>k</sup> are the approximate state and measurement vectors. xˆ<sup>k</sup> is an a posteriori estimate of the state at step k, random variables w<sup>k</sup> and v<sup>k</sup> represent the process and measurement noise. B is the Jacobian matrix of partial derivatives of f(•) with respect to x, W is the Jacobian matrix of partial derivatives of f(•) with respect to w, H is the Jacobian matrix of partial derivatives of h(•) with respect to x, V is the Jacobian matrix of partial derivatives of h(•) with respect to v.

An essential feature of the EKF is that the Jacobian H<sup>k</sup> in the equation for the Kalman gain K<sup>k</sup> serves to correctly propagate only the relevant component of the measurement information [102]; the linearization error is always exist as the function is nonlinear. Because increasing the sampling time and reducing the nonlinearity of function are not always viable, error-state EKF is proposed to counteract the adverse effect. The basic idea of error-state EKF is to reduce the distance of the linear approximation from the operating point; instead of linearization of the nominal state, it handles the error state.

#### **4.5.2.2 Unscented Kalman Filter**

When the state transition and observation models, that is, the predict and update functions f and h are highly nonlinear, the EKF can give particularly poor performance. This is because the covariance is propagated through the linearization of the underlying nonlinear model. By contrast, the UKF uses unscented transform instead of linearization in the prediction and correction steps to make the estimation. As the first step, the decomposition of the covariance matrix shall be computed and the sample points shall be carefully selected, for instance, here I selected 2L+1 points, described as,

$$\begin{aligned} \mathcal{X}^0 &= \bar{x} \\ \mathcal{X}^i &= \bar{x} + (\sqrt{(L+\lambda)P\_x})\_i \\ \mathcal{X}^{i+L} &= \bar{x} - (\sqrt{(L+\lambda)P\_x})\_i \end{aligned} \qquad \begin{aligned} i &= 1, \dots, L \\ i &= 1, \dots, L \end{aligned} \tag{4.6}$$

where x¯ is the selected mean sample point and λ is set as λ = 3− N for Gaussian probability density function. After that, the sigma points will be propagated through the nonlinear function as,

$$\check{X}^i = f(\mathcal{X}^i) \qquad i = 1, \dots, 2L \tag{4.7}$$

where f(•) is the motion model function and and Xˇ<sup>i</sup> is the predicted position based on the motion model. As the final step of predicted process, the predicted mean and covariance should be calculated.

$$a^i = \begin{cases} \lambda/L + \lambda, i = 0\\ \lambda/2(L + \lambda), otherwise \end{cases} \tag{4.8}$$

$$
\check{X} = \sum\_{i=0}^{2N} a^i \check{X}^i \tag{4.9}
$$

$$\check{P} = \sum\_{i=0}^{2N} a^i (\check{X}^i - \check{X})(\check{X}^i - \check{X})^T + Q \tag{4.10}$$

Here Q denote the noisy of this process. In the correction step, I firstly calculate the predict measurement with the sigma points,

$$\hat{y}^{(i)} = h(\check{X}^i, 0), i = 0, \dots, 2N \tag{4.11}$$

where h(•) is the predicted measurement model, considering the process noise: LˇLˇ<sup>T</sup> = Pˇ. Then, I get the mean and covariance of predicted measurements,

$$\mathcal{Y} = \sum\_{i=0}^{2N} a^i \mathcal{Y}^i \tag{4.12}$$

$$\check{P}\_y = \sum\_{i=0}^{2N} a^i (\mathcal{Y}^i - \mathcal{Y}) (\mathcal{Y}^i - \mathcal{Y})^T + R \tag{4.13}$$

where R is the noisy. Based on the previous result, I can compute the crosscovariance and Kalman gain, and then get the corrected covariance and mean.

$$P\_{xy} = \sum\_{i=0}^{2N} a^i (\check{X}^i - \check{X})(\mathcal{Y}^i - \mathcal{Y})^T \tag{4.14}$$

$$K = P\_{xy} P\_y^{-1} \tag{4.15}$$

$$
\hat{P} = \check{P} - KP\_yK^T \tag{4.16}
$$

$$\mathcal{X}^i = \check{X} - K(y^0 - \mathcal{Y}^i) \tag{4.17}$$

where y 0 is the measurement result with respect to Xˇ, X i is the final results from UKF. As can be seen, UKF does not need Jocobian matrix so that it can achieve a better performance in estimation. Usually, the computation effort can be slightly higher than the other methods, making carefully select sensors configurations meaningful.

#### **4.5.3 Realtime Map Plotter**

Inspired by the research from Fankhauser [82], a map can include many layers to store different types of data information. Thus, to develop a realtime plotter of the construction site according to ground condition, I use a multi-layer grid-based map, which divides the environment into uniform cells. Fig. 4.4 illustrates the multilayered grid map concept, where each cell data is stored on the congruent layers. In this project, since resistance and grade of the road are the most of importance information for the construction machines, I adopt a two-layer grid map. However, I show the method to save three different information in this chapter. Concretely in this chapter, the map is divided into small cells, whose resolution is 1 m per cell. Notice that I use a much smaller cell than the cell used in the previous chapter. This is on one hand because the measurement did by foot SLAM cannot cover so large areas in one step. On the other hand, a finer grid map describes the terrain information better. Apparently, the map used in the previous chapter can be processed from the finer map providing in this chapter. As I discussed in the previous section, I use GPS/IMU fusion Kalman filter algorithms to locate the mobile machine on the construction site.

As can be seen in Fig. 3.4, to describe the ground condition of construction sites, I use the value of each cell to represent the information of the ground situation. An

Figure 4.4: A grid map created by SLAM. My approach uses multilayered grid maps to store data for different types of information. Concretely, every grid saves a 1\*3 matrix including location information and resistance or grade, depending on which layer it is. A grid with site information will be created after the vehicle passes by. The map is saved as a 2\*m\*n\*3 tensor, where m is the max displacement in the x-direction, while n is the max displacement in the y-direction. In case a grid does not be occupied once by the vehicle, it will be marked as NaN to denote the unknown regions.

exemplar layer that holds the data of a grid-based map has been shown in Fig. 3.4 in the previous chapter. Obviously, although I only demonstrate the approach with two layers, it is relatively easy to extend the third layer in case that uphill or downhill is vital, as the third layer is responsible for recording the heading of the vehicle.

Based on the previous study [84], both resistance and grade of the road can be gathered in realtime. Thus, I assume that the ground resistance and slope are known after the mobile machine passed by. When the mobile machine passes through each grid cell, ground information will be added to the corresponding grid of the two layer-grid-maps. Concretely, the plotting algorithm combines the localization results from the Kalman filter and the grade and resistance information from the recursive least square algorithm. To implement the plotting algorithm in ROS, a node was created in ROS that can subscribe to the localization results from the Kalman filter node and gather the current information from ground truth maps with the assumption that resistance and grade of the road can be estimated or measured well [118].

## **4.6 Vehicle Simulation Scenarios in ROS and Gazebo**

To test the feasibility of the map plotter, different road conditions on a construction site was first defined. ROS is a mature and flexible framework for robotics programming, providing the required tools to easily access sensors data, process that data, and generate appropriate responses for the robot's actuators. Due to these characteristics, ROS is a perfect tool for many types of research on modern robotics. After all, a mobile machine can be considered as just another type of robotics, so the same types of programs can be used to develop advanced construction machines. In this chapter, a construction site showed in Fig. 4.5 was simulated. According to this real construction site, five different ground resistances in the simulation environment are defined, according to different ground conditions. Moreover, two areas according to different slopes. Concretely, the different regions were distinguished based on road material.


5. Dry concrete surface: Dry concrete is a normal building material and the typical rolling resistance coefficient of the dry concrete road is 0.008.

Besides, two different ground slopes were also defined appropriately. Since only a two-layer grid map is used in this research, I do not discriminate uphill or downhill.


For simulation, the green area was defined as the dry concrete surface in the ground-truth resistance map. The red area represents the gravel surface. The blue area represents the sand surface, the black area, and the yellow area represent dry and wet dirt roads, respectively. With this premise, the ground truth map upon this construction site is drawn. Same as in the ground-truth resistances map, the ground slope map is also defined, where the green area represents the flat area and the red area represents the 15° slope.

In this project, the plot\_node [118] was written with Python and OpenCV library to visualize the plotted map. To test the feasibility of a realtime plotter in simulation, a ground\_truth node [118] is used in ROS, which provides a ground truth position of the simulated mobile machine in Gazebo. When the mobile machine moves to a certain position, the system uses the ground truth position to determine the rolling resistance coefficient and the road grade, and then uses the Kalman filter filtered position to plot the corresponding information in the grid-based map.

In ROS and Gazebo, the build-in plugins provide many adjustable parameters that can be used to adjust the devices' performance. To get closer to reality, the performance parameters of GPSs and IMUs was set in the simulation according to real GPS and IMU devices. To get the best sensor configuration, different sensor

(a) Example construction site divided into five areas according to different ground resistances.

(b) Example construction site divided into two areas according to different slopes.

Figure 4.5: The ground truth map with dimensions. The simulation environment I used in Gazebo was modeled based on a real construction site, and the parameters are selected according to material characteristics. Since simulating a small construction site may cause system error and thus lack plausibility, I augmented this real construction site's dimensions in Gazebo.

configurations for different algorithms were implemented. Each group of sensor

configurations were simulated under the same condition by rosbag, and the results were output simultaneously.

## **4.7 Experiment and Results**

#### **4.7.1 Localization Results**

To explore the most suitable sensor arrangements of the Kalman filter for construction machinery, the results from different sensor configurations with different methods were compared. The concrete sensor arrangements in this project are shown in Tab. 4.1.


Table 4.1: Sensor configuration. In my simulation, I ignore the mirror difference caused by the slightly different installation position of sensors. Therefore, the various configurations are reduced from 64 to 16. Since odometry is robust and necessary for many applications, I do not consider the case without an encoder

To evaluate the performance of the different variants of sensors and algorithms concerning accurately positioning, I controlled the wheel loader to drive on the previously defined construction site in Gazebo, and recorded the data from sensors at the same time. Afterward, the localization results of the different methods are compared to the ground truth. Here I use the Root-Mean-Square Error (RMSE) as a quantitative indicator of the error to assess the pose estimation results.

$$RMSE = \sqrt{\frac{\sum\_{i=1}^{n} \left(\tilde{S}\_i - \hat{\tilde{S}}\_i\right)^2}{n}} \tag{4.18}$$

Where S ˆ˜ is the vector including estimated x and y position, <sup>S</sup>˜ is the denotes ground truth x and y position, n is number of all estimated samples, and the footnote i denotes the i th time step.

Fig. 4.6 shows the wheel loader estimated trajectories given by different approaches and the ground truth trajectory, where the red lines are the ground truth trajectories that the vehicle passes, and the blue lines are the estimated position of the vehicle. Since only EKF and UKF are capable of handling the nonlinear problem, I compare the results obtained using EKF filtering and the UKF filtering technique with different sensor arrangements. Notice that, odometry sensor is always used though I do not explicitly mention it. Apparently, the UKF performs better than the EKF, which is also in line with the conclusion from most studies. Generally speaking, with GPS fusing in the estimation, the accuracy improves drastically. Also, since the GPS may lose signal every 10 seconds, an additional GPS sensor can surely increase the position accuracy. In contrast, more IMUs can only slightly improve the accuracy of positioning. As can be seen, the IMU + Odometry estimation method yields inferior performance no matter with EKF or UKF. Once the IMU reports an inaccurate heading, the errors will be accumulated, causing the measured position drifts further away from its true position as the wheel loader travels.

Fig. 4.7 shows a comparison of localization error between the different approaches with respect to time. As shown in Fig. 4.7, the RMSE and the euclidean distance

Figure 4.6: The IMU + Odometry estimation method yields for both EKF and UKF the RMSE over 70 m. Except for the IMU + Odometry methods, the other results can be divided into four performance levels according to the error scale. The worst level is EKF with 1 GPS, where the RMSE is about 3 ∼ 4 m. The second level is EKF with 2 GPSes and EKF with 3 IMUs and 3 GPSes. In these cases, the RMSEs are about 2 ∼ 3 m. A better level is UKF with 1 GPS and 1 IMU, where the RMSE of the third level can be achieved about 2 ∼ 2.5 m. Finally, my experiment's best class is UKF with 2 GPSes and 1 IMU, and UKF with 3 IMUs and 3 GPSes, which reduce the RMSE to about 1.2 m.

error of each sensor arrangement can be obtained, indicating that the EKF has more significant tracking error than UKF. It happens because the linearization through its Jacobian is an approximation, and the kinematic model of the wheel

Figure 4.7: Quantitative evaluation of different methods. Here (a) and (b) show the accumulative error, i.e., RMSE, while (c) and (d) demonstrate the current Euclidean distance error of each sensor arrangement. There are some noticeable instantaneous position changes every ten seconds due to infrequent GPS signal loss. Generally, UKF shows better performance than EKF for both RMSE and euclidean distance error.

loader is highly nonlinear. To get a more intuitive and accurate description of the error for each sensor configuration and method, Tab. 4.2 is created to demonstrate the RMSE in detail.

As aforementioned, the GPS signal will be lost for one second every ten seconds. Thus, there are some noticeable instantaneous position changes every ten seconds. The loss of GPS signals clearly causes these jumps. As can be seen, for sensor configuration with only one GPS, the leap of the localization error as well as the variance values are more sharply. Obviously, with the number of GPS increases, the jumps are diminished and therefore become more acceptable.


Table 4.2: Localization errors of each approach

Apparently, the simulation results show that UKF is a better approach using data collected by onboard sensors of the wheel loader in gazebo environments. Intuitively, more sensors represent higher accuracy. However, the results show that an appropriate number of sensors can achieve acceptable accuracy at a lower cost. As can be seen in Tab. 4.2, the RMSE of UKF with 1 IMU 2 GPSes is 1.72 m, and UKF with 3 IMUs and 2 GPSes is 1.40 m. With additional 2 IMUs and 1 GPS, the RMSE is only slightly reduced by 0.3 m. More importantly, the maximal error is strongly diminished by one additional GPS, whereas continually increasing the sensor number does not further reduce the error proportionally. Of course, according to different application scenarios, different sensor configurations shall be chosen. For my application scenario, UKF with 1 IMU and 2 GPSes has sufficient accuracy and a better economy respecting sensor hardware cost and onboard ECU computational effort. Thus, I suggest using this sensor configuration to locate the mobile machines and then develop the realtime map plotter.

### **4.7.2 Plotter results**

As my ultimate goal is to create a map of the current working site in realtime based on localization technology so that corresponding optimization can then be achieved, the ground truth maps and estimated maps with different sensor configurations coupled with various algorithms are shown and compared in Fig. 4.8.

Figure 4.8: The ground truth and estimated maps. In the ground resistance map and road grade map plotted by EKF with 1 IMU and 1 GPS, the spikes caused by infrequent GPS are quite obvious. With additional GPS sensor fused in Kalman filter, the spikes improve a lot.

Since I use a two-layer grid map, both rolling friction coefficient and road grade are recorded and used to create the estimated maps. In this chapter, the ground information are saved in corresponding grid after the wheel loader passes by and identify the ground information such as friction and slope through the specific algorithm, e.g. recursive least square. As aforementioned, to locate the mobile machine's position in a cost-efficient fashion, the configuration of the UKF with 1 IMU and 2 GPSes is adopted. Also, in order to compare the performance of the selected configuration, I draw the results of the EKF with 1 IMU and 1 GPS and the UKF with 1 IMU and 1 GPS as the control groups. The mispredicted points are calculated as described in Eq. 4.19,

$$E\_{r,s} = \sum\_{i=1, j=1}^{m,n} e\_{i,j} = \begin{cases} 1, & \text{if } \neg(\mathring{G}\_{i,j} = G\_{i,j}) \cap \neg(\mathring{G}\_{i,j} = NaN) \\ 0, & \text{if } (\mathring{G}\_{i,j} = G\_{i,j}) \cap \neg(\mathring{G}\_{i,j} = NaN) \\ 0, & \text{if } \mathring{G}\_{i,j} = NaN \end{cases} \tag{4.19}$$

where Gˆ i,j denotes the estimated grid map information, Gi,j is the ground truth map information, E is the accumulated number of errors, the subscript r and s denotes resistance and slope map, respectively. Obviously, the goal is to minimize the percent of mispredicted points versus total predicted points, described as Eq. 4.20,

$$\min J\_{r,s} = \frac{E\_{r,s}}{T} \tag{4.20}$$

where T is the total estimated number, and J is the quantitative criteria to evaluate the accuracy of the plotted map.

The localization errors of each group are shown in Tab. 4.3, where the necessity of the introduction of the second GPS sensor, and the adoption of unscented Kalman filter is shown.


Table 4.3: Localization errors of each approach during the plotting

To evaluate the plotted maps' accuracy, Matlab was used to implement an algorithm [118] to compare the predefined area and the plotted path the mobile machine traveled. Fig. 4.9 graphically illustrate the mispredicted grid point with white color, where the mispredicted points are generally distributed in the marginal zone. This is because even the localization technology makes some mistakes; the problem is unlikely to cause an error as long as the vehicle is not at the very edge of different zones, indicating the robustness of this mapping idea of the construction site.

(a) Result of EKF with 1 IMU and 1 GPS. (b) Result of UKF with 1 IMU and 1 GPS. (c) Result of UKF with 1 IMU and 2 GPSes.

(d) Result of EKF with 1 IMU and 1 GPS. (e) Result of UKF with 1 IMU and 1 GPS. (f) Result of UKF with 1 IMU and 2 GPSes.

Figure 4.9: Difference between ground truth and the estimated map. Here I compare the predefined areas and the plotted path, where the white pixels are the wrong plotted grids. (a),(b),(c) are the Friction map results, and (d),(e),(f) are the road grade map results. UKF shows more accurate positioning capabilities than EKF, and with two GPSes fused in the Kalman filter, the wrong located grid is less than just with one GPS signal.

After calculating the wrong located grids and all the plotted grids and according to the Eq. 4.19 and Eq. 4.20:


## **4.8 Conclusion**

In this chapter, I proposed an approach to creating a multi-layer map of the construction site in realtime so that the environmental information can be taken into account to improve vehicle efficiency and safety further or contribute to the path planning of mobile machines on the construction site. Considering the common phenomenon in reality mentioned by other researchers, such as noisy sensors and infrequent signal loss, the simulation environment was setup in Gazebo with a ROS package based on a real construction site. According to my tests in Gazebo by implementing a series of sensor configurations, I found that the configuration that 1 IMU and 2 GPSes with encoder using UKF has the best for overall performance with respect to accuracy and cost. By comparing the estimated maps drawn by map plotter and the predefined maps, the errors are only 1.0% and 1.2% for road resistant force and grade, separately. Thus, I believe that the developed plotter can be used to save the road condition in realtime within a reasonable error range to offer the terrain and location information to the multi working machines pathfinding algorithm.

# **5 Motion Prediction of Manned Working Machines<sup>1</sup>**

The tasks on a working site can be quite diverse. Although a strong tendency to introduce autonomous systems into mobile machines, I think it is still challenging to transfer a whole working site into a fully autonomous working site in the next decade. Thus, I endow AI concept with the capability to cooperate with human drivers regarding road occupation. For instance, wheel loaders even move during working unlike excavators which usually stay there. Also, a wheel loader is a very commonly used machine on working sites worldwide. Obviously, the motion of a wheel loader has a strong relationship with its working process. Thus, in this chapter, I proposed a series of Multivariate Time Series Classification algorithms, namely CRDNNs, which combine Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Dense Neural Networks (DNN), to precisely recognize the working process and thus the motion of machines. Compared to existing algorithms, the CRDNN with bi-directional LSTMs has the best accuracy, and the CRDNN with LSTMs has a comparable performance but much fewer training parameters. Based on my dataset including 119 truck loading cycles, my best neural network shows a 98.2% test accuracy. Afterward, I introduced the transfer learning and human-machine communication system to increase the generalization ability of the selected algorithm.

<sup>1</sup> Except some tiny modifications, all the figures, text, and results of the presented work in this chapter have been published in my publications [63, 64]. My contributions to the papers are summarized as 100% in terms of conception and methodology, 90% of literature review, 80% of code, 30% of data collection and labelling, 80% of results visualization, and 95% of formulation, for both of them.

## **5.1 Introduction**

The intrinsic sensors in the machines offer robust signals and information about the machines. By assessing the Multivariate Time Series data, the working process of the machines can be acquired. However, the selection of the variables highly depends on the machine system. To date, there are two kinds of mature torque control for hydrostatic drivetrain solution. Unlike the secondary control concept that typically has one or more hydraulic accumulators to build up a constant pressure and controls the output torque by adapting the angle of the hydraulic motor, the primary torque control concept [120] controls the pressure in a closed circuit by changing the angle of the hydraulic pump based on a feedback system but without accumulators. In this chapter, I use the primary-torque-based concept as a reference to show the performance of the working process detection algorithm. In this chapter, I focus on the Y cycles detection algorithms.

## **5.2 Background**

### **5.2.1 Wheel Loader**

The wheel loader is a typical mobile machine used for moving earth. A typical working process is the so-called Y cycle. Concretely, the machine digs the heap and transfers the soil to a truck. During this process, the machine is usually driving in a trajectory similar to a letter Y. Fig. 5.1 illustrates this cycle.

The Y cycles are the most typical working process of wheel loaders. The performance during the Y cycles has a decisive effect on the holistic performance of the mobile machine. Also, the working process detection is used as a vital criterion to predict the intention of the drivers.

Figure 5.1: A typical truck loading process (Y cycle).

### **5.2.2 The Future Mobile Machines Drivetrain System**

Murrenhoff has drawn a rule in [121] to classify the different kinds of control concepts, and concrete details are shown in Fig. 5.2.

Figure 5.2: Segment of control concepts [121].

Currently, most mobile machines use the flow-based controlled drivetrain, which controls the vehicle speed by the volume flow pass the hydraulic motor and thus the vehicle velocity [122]. The advantages of such a drivetrain solution are due to the decoupling of the engine and the vehicle speed [116]. However, the efficiency of this concept can even lower than 10% [123] in many applications. Thus, many improvements based on these concepts have been drawn [124]. Based on my literature analysis, I find the research focus of the scientists in the field of mobile construction machines goes to the torque based controlled concept [125, 126, 127]. The initial proposal to introduce the torque control concept consists of higher holistic efficiency, flexible system architecture due to modulation, and more suitable for the employment of a hybrid system. Apparently, different control concept leads to different system layout and corresponding internal sensors selection. Since torque-controlled mobile machines may win the competition in the long term, I focus on the technologies that can be used on the torque-based mobile machines in this chapter, especially the primary torque-based control introduced by Bosch Rexroth AG in 2018 [120] for hydrostatic mobile machines. Since the measured variables in the primary torque concept can also be interpreted as secondary control concept, my algorithm can be principally adapted to the secondary controlled mobile machines with some further works.

One significant advantage of primary torque control is its high efficiency due to the successful introduction of central power management. The basic idea of central power management derives from the requirement that the power made available to the system should be precisely the same as the power consumed by the system. Besides, in case of power shortage, power management will cut down the power supply to the devices which have a lower priority [120]. To follow this basic idea, every component will compute the energy it requires first, and then the center power manager gathers the information, compares it with the disposable power of power source, and distributes the power to each requester [120]. On hydrostatic mobile machines, there is no restrain condition between engine rotation speed and vehicle speed. Thus, optimization of engine efficiency is possible. Generally, the engine speed is set to as low as possible considering the requested vehicle dynamics.

#### **5.2.3 Working Process Detection Algorithms**

Many scholars are interested in utilising machine learning to improve mobile machines regarding efficiency, maintenance, and usability. Especially in the field of working process detection, a series of methods have been drawn. Pohlandt has used two simple neural networks to predict and recognize the desired work process separately on electrical mobile machines [128]. According to his publication, he splits the time series of measured power into many small slip windows to train the simple neural networks. In his research, he found out neural networks might work for some simple cases [128]. Another research is from Brinkschulte who points out that the prediction accuracy with bagged trees may dramatically decrease when the drivers have different driving skills [129]. Besides machine learning algorithm, research by Nilsson introduced a method that combined several individually simple techniques including signal processing, state automation techniques, and parameter estimation algorithms. Based on 159 cycles, the accuracy is 93% [130]. In 2019, Keller made a case study for an excavator to classify the machine functions using decision tree with an accuracy of 99.97% without using slip windows [131]. In addition, Starke shows that Y cycle can be online recognized with hidden Markov Models (HMM) since HMM was widely used within the context of word recognition to deal with the temporal variability of text or speech [132], before 2012 [133]. Also, he pointed out that truck loading is a high variance problem and a simple algorithm should be used owing to the limited of on-board ECU [132].

## **5.3 Problem Statement**

Based on my dataset and previous studies, I summarize the problems faced in this research.

First of all, the detection of arbitrary Y cycles is a high variance problem. Y cycles are different from site to site. The distance between heap and truck can be quite different. Moreover, drivers are also different. Some drivers have many years of driving experience, and thus have become more aggressive. By contrast, some drivers are still novices who correct themselves during some processes. Last but not least, the materials for transport are different. Therefore, a complicated method is needed to handle the high variance.

Another problem is the limitation of the computing capacity of the ECU on mobile machines. Backpropagation consumes more CPU than forward propagation; thus, online learning usually entails the adoption of a swallow neural networks or an efficiency well-known machine learning algorithm, such as Support Vector Machine (SVM). Consequently, less intelligent learning ability is expected. In light of that, I dedicate to adopt an off-line method. Since a simple algorithm might not be really good for dealing with high variance problems, scientists in the fields of Natural Language Processing (NLP) usually use algorithms that combine many technologies. In the case of HMM, Vocal Tract Length Normalization (VTLN) and feature-space Maximum Likelihood Linear Regression (fMLLR) are used before HMMs [134]. Neural networks should also be combined together [135].

## **5.4 Why I Use RNN, LSTM?**

The initial idea to use Long Short-Term Memory (LSTM) is inspired by analogy. Recurrent neural network has been proven to be a powerful tool in the fields of NLP in the past years [133]. One of the significant progress is the introduction of LSTM [136, 137]. More details about LSTMs can be found in [138]. In western countries, clauses are used extensively in writing, making sentences extremely and differently long. Splitting the sentence into many words and using a simple DNN with a certain number of input layer units, the translation performance is usually unsatisfying. Intuitively, different lengths of sentences make the selection of numbers of input units difficult. A deeper reason lies in the fact that, the simple neural network does not take the sequence of words appearing in the sentence into consideration. With limited input units, simple neural networks can only detect the current situation based on a specific past period. If the decisive information occurred a long time ago, the AI must make its decision based on somewhat useless information causing no wonder a detection mistake. To overcome this problem, LSTM uses update and forget gates to make a shortcut for the vital information to help with the current decision. Akin to complicated sentences with clause, Y cycles can have very different lengths due to its transport process or workers of different proficiency. In light of that, LSTM shall solve this principally similar challenge.

Since my goal is using AI algorithm to detect the working process and thus improve the efficiency of mobile machines by regenerating, "future" information can be used to increase the detection accuracy. Generally, earthmover is first be accelerated in reverse direction and then decelerated after digging into a heap. The duration here implies that even though the algorithm does not recognize directly at the time Y cycle begins, it does not harm the regeneration performance as long as it detects the Y cycle slightly before the deceleration process. Therefore, I expect bi-directional LSTM to improve prediction accuracy.

Similar to HMM that may use some additional technologies to improve its performance, LSTMs also have better performance if CNNs and DNNs are cooperating together [139, 140, 141]. The advantages of the combination of CNNs, RNNs, and DNNs, which I call CRDNN in this chapter, are shown in the next sections.

## **5.5 Data for Deep Learning Algorithm**

As aforementioned, I used a series of neural networks to recognize Y cycles. For the sake of simplification, it can be concluded that AI is a scientific method for recognizing pattern based on the data with which it has been trained. However, truck loading cycles can be quite different from each other regarding traveling length between heap and truck, driver's skill level, materials, and the dimension of mobile machines. The training of an end-to-end neural network needs a vast dataset, which is still a cost-challenging task today. Instead, I proposed a multi-step approach to detect truck loading cycles. Instead of predicting the Y cycle directly, I firstly predict the loading, the traveling, and the unloading processes since much less data is required for training. Furthermore, after neural networks output its prediction, I might use a modification measure to avoid obvious mistakes. In the light of that, I divide my processes into three sub-processes: vehicle travelling, loading, and unloading.

As I mentioned before, I would like to use a more complicated and therefore smarter neural network so that I adopt the off-line learning method to avoid the time-consuming backpropagation.

## **5.5.1 Data Acquisition and Allocation**

Data is the heart of deep learning. I split the dataset into training and test dataset, 80% and 20%, respectively. Besides, I consciously selected three different drivers and did the measurement in different days. Some measurements were done on a rainy day so that the density of material changes. Moreover, I changed the position of heap and truck to vary the length of the Y cycles. The test drivers were not given the information about what I was going to do so that they would behave the same in their daily operations. In short, I consciously increased the diversity of my dataset and tried to include more challenging cases in my dataset.

The data I fed into the neural networks were selected from the typical sensors on primary torque controlled mobile machines. Concretely, there are the pressure difference inside of the bucket, the vehicle velocity, the vehicle direction signal on joystick, the pressure difference inside of closed-circuit drivetrain, and the pressure difference inside of the boom. The sample rate is 50 Hz so that I would not overload ECU.

Totally, I have created a dataset with 119 Y cycles. 40 of them are gathered from an experienced test engineer, 30 of them are from a development engineer who has aggressive drive behavior, 20 of them are collected when the machine was not well tuned, 29 of them are measured by a senior manager who works many decades in the field of mobile machine. None of the data is collected by a complete layman since I do not think it makes sense. Notice that I allocated the data collected as the machine with insufficient calibration process into training dataset since it can improve the robustness of my algorithm but not affect the test accuracy. The measurement dataset is labeled as shown in Fig. 5.3. In the real world, the data is often not perfect. That is, some people may mislabel a tiny portion of data. Therefore, I deliberately labeled some windows as travels through it is actually a loading or unloading process to check the robustness of my algorithms.

Figure 5.3: Normalized measurement data with label. It is important to highlight that I consciously mislabel the data around 450 s to test the model robustness.

Apparently, the variables pressure inside of bucket and vehicle velocity indicates a very strong seasonally. The pressure inside of closed-circuit implies the behaviors of a loading process. Moreover, the signal on joystick demonstrates the state change. Solely based on these variables, an experienced test engineer can tell whether the mobile machine is loading or unloading with almost 100% accuracy. Without a doubt, a deep learning model can take over the job to detect Y cycles. However, in the measurement data I have, the Y cycle is not always so regular. For instance, one Y cycle does not always begin with a loading process, followed by an unloading process. The driver might think he has loaded too small amount of load so he comes back after a small reversing process and digs into the heap again. It happens when the driver is not so skilled or is mistaken. Such a case increases the difficulties of detecting the truck loading process by deep learning.

#### **5.5.2 Data Preparation**

The measurement data were not pre-treated by human observation depending on the dataset before fed into neural networks even I agree a pre-processing can surely increase the accuracy of prediction. The reason for that is I am worried about the pre-processing may exaggerate the performance of neural network since some pre-processing technologies are almost impossible in reality. Therefore, I did use the non-adaptive method to prepare the dataset: only a first-order system is used to smooth my data. After that, the dataset will be split into small slip windows. If the size of time windows is 10 sample times, the events in the past 2 seconds are taken into consideration since my sample frequency for creating slip windows is 5 Hz. Every slip window slips one sample compared to its previous window. Slip windows are used for avoiding the influence of data too long ago.

The measured data is normalized before training since I want to avoid one single variable that has too much influence on each gradient descent step. As a result of that, the cost function's shape changes into a more spherical one rather than a high curvature ellipse one.

Besides, the labeled date is converted into one hot vector to have the same categorical value, as shown in Eq. 5.1,

$$Y^{(1)} = \left[ \begin{array}{cc} 0 & 1 & 0 \end{array} \right] \tag{5.1}$$

which demonstrates that the 1st sample is labelled as loading. In my dataset, 11.62% of all working time is in the loading process, and 7.86% is in the unloading process. Obviously, my dataset for truck loading process is skewed. That means, even if the system always predicts that it is neither in loading nor unloading process, it has a test accuracy at about 80%. To avoid it, Confusion Matrices (CM)s and micro average F1 scores should be used to evaluate the performance of my algorithm. Based on an exploratory training, the training cost without anti overfitting goes down to an extremely low level while the test cost goes firstly down and then explodes up. This indicated that my dataset has well considered the variance of Y cycle in different cases.

## **5.6 Combined Neural Networks**

To have a better detection performance, I take the advantages of combined neural networks, and thus selected CRDNNs as my tool. In this section, I am going to present the combination of CNNs, RNNs, and DNNs. They all have limitations and advantages so that I believe that combined neural networks can be complementary for the disadvantages of each other. For example, LSTMs are good at temporal modelling while they cannot have a more significant number of hidden layers.

As shown in Fig. 5.4, I use one one-dimensional convolutional neural network (conv1D) at the beginning to provide better features for LSTMs. It is followed by two DNNs to reduce the dimension of the output of CNN. Further, I add two LSTMs since it is considered as an excellent tool for many time series applications. At the end, two DNNs are used to increase nonlinear hidden layers and thus increase the prediction performance by making a deeper mapping. The

core of LSTMs is the update and forget gate to handle the long and short term data. Eq. 5.2 demonstrates the idea.

$$\begin{cases} \tilde{c}^{(t)} = \tanh\left(W\_c[a^{\langle t-1\rangle}, x^{\langle t\rangle}] + b\_c\right) \\ \qquad \Gamma\_u = \sigma(W\_u[a^{\langle t-1\rangle}, x^{\langle t\rangle}] + b\_u) \\ \qquad \Gamma\_f = \sigma(W\_f[a^{\langle t-1\rangle}, x^{\langle t\rangle}] + b\_f) \\ \qquad \Gamma\_o = \sigma(W\_o[a^{\langle t-1\rangle}, x^{\langle t\rangle}] + b\_o) \\ \qquad c^{\langle t\rangle} = \Gamma\_u \* \tilde{c}^{\langle t\rangle} + \Gamma\_f \* c^{\langle t-1\rangle} \\ a^{\langle t\rangle} = \Gamma\_o \* \tanh c^{\langle t\rangle} \end{cases} (5.2)$$

Generally, the learning ability is increasing as the number of hidden layers increase. However, more hidden layers result in much more training parameters that may be a heavy load for vehicle ECU. In this section, I evaluate the CRDNNs' test accuracy regarding the hidden layers, the units in a hidden layer, and time windows.

Theoretically, LSTMs can work without a slip window. However, I need to avoid the effect of the data before a disruptive event, such that the driver stops the vehicle to relax for a while, which affects the prediction performance. Therefore, I also use the slip windows for CRDNNs. The window size can affect the performance of neural networks since a larger window size allows the neural networks to consider a more extended period to make the decision.

Since I want to know which model has the best test accuracy and which one has a good test accuracy but with fewer training parameters, I show the different performances of different architecture with different window sizes. I supervise the training- and test costs over epochs and stop the optimization process when there is a noticeable tendency that test cost increases. The training cost versus iteration of different neural networks is shown in Fig. 5.5. For example, in the case of CRDNN with 2 LSTMs that is fed the data with a window size of 9, I stop the iteration at epoch 60.

Figure 5.5: Training- and test costs versus epochs.

To find out the suitable hyper-parameters of neural networks, I analyze the weights of each layers of neural networks. However, while people recognize the working process mainly by watching the pressure inside of bucket, CRDNNs do not pay too much attention to this variable since the absolute value of weight for it is no considerably larger than the others.

As shown in Fig. 5.5, the cost goes down to a certain level and then fluctuates if the regularization and drop-out method are used. Notice that the cost with anti-overfitting methods is higher since I use the regulation method: it does not mean the accuracy is worse than the one without anti-overfitting methods. Also, I add weights to the cost function. The weights can avoid a certain kind of error

Figure 5.6: Confusion matrices of CRDNNs.

by recognizing. For instance, if the weight on loading is higher than the weight on the traveling process, the optimization process will take more attention to avoid the errors on loading rather than on the traveling process. Formally, see Eq. 5.3.

$$J(\theta) = \frac{1}{m} \sum\_{i=1}^{m} \sum\_{k=1}^{K} [-y\_k^{(i)} \log((h\_\theta x^{(i)})\_k)$$

$$-(1 - y\_k^{(i)}) \log(1 - (h\_\theta x^{(i)})\_k)] W\_k \tag{5.3}$$

$$+ \frac{\lambda}{2m} \left[ \sum\_{j=1}^{32} \sum\_{k=1}^{32} (\theta\_{j,k}^{(1)})^2 + \sum\_{j=1}^{3} \sum\_{k=1}^{32} (\theta\_{j,k}^{(2)})^2 \right]$$

93

Obviously, compared to the cost function without regularization, the regularization might increase the total value of cost function. The W<sup>k</sup> denotes the weight of k state. In my case, I recommend setting the weights as

$$
\tilde{W} = \begin{bmatrix} 1 & 4 & 7 \end{bmatrix}^T \tag{5.4}
$$

Since rectified linear unit (ReLu) has a constant gradient if the X>0, I use ReLu as activation function so that the calculation effort can be reduced and thereby converging or learning much faster. The Hyper-parameters I used are shown in Tab. 5.1.


Table 5.1: Parameters of CRDNN

Generally, I shall use the F1 micro average to evaluate and select the best suitable architecture. Nonetheless, since I am going to implement an operation strategy based on the learning algorithm later, the F1 score alone does not indicate whether a result is easy to correct or not, so I also use CMs to evaluate the results, see Fig. 5.6, where the abscissa indicates the predicted value and the ordinate indicates the ground truth label. e0, e1, e<sup>2</sup> denote the travelling process, the loading process, and the unloading process appropriately. The F1 score is used as a subordinate criterion to select the better overall performance solution. Obviously, the CRDNN with two bidirectional LSTMs has the best performance, which is similar to my assumption, with an overall accuracy at 98.5%<sup>2</sup> . Compared to simple DNNs, CRDNN has an improvement of about 2.4%. Bidirectional LSTMs make the decision using a relatively more prolonged-time period and can consider the data after the event so with no doubt it has better accuracy. The improvement compared to DNN is because LSTMs are good at dealing with long term problems so that I can use a larger window size to feed into CRDNNs. Another potential architecture is CRDNN with two LSTMs, which is only slightly worse than the one with bidirectional LSTMs but the training parameters are much fewer. An additional advantage of CRDNNs is that most of them have even fewer training parameters though they have a complicated architecture. Compared to the simple neural networks with two hidden layers and 128 units per layer, CRDNN with two layers of LSTMs has only 16,295 training parameters while the former has more than 30,000 training parameters, resulting in a much faster on-board calculation.

## **5.7 Evaluation of the Methods**

As the results are shown in the last section, the improvement of test accuracy is almost stopped at 98%. To further improve the prediction accuracy, I draw out the place where my algorithm has made a mistake.

As shown in Fig. 5.7, I illustrate the ground truth and the mistakes made by CRDNN with two bidirectional LSTMs. The blue line denotes the ground truth label and the color points represent the place where CRDNN recognizes a different result as ground truth label and thus I say it makes a mistake. Obviously, the mistakes mainly occur at the time when the machine changes its state from one to another. However, I must point out that they are the states which are also controversial for humans to say whether the state should be loading, unloading,

<sup>2</sup> Note that the dataset consists 11.62% time proportion of the load process, 7.86% time proportion of the unload process, and the rest is the travelling process. Since the travelling process is the vast majority, the overall accuracy is close to the accuracy of the travelling process prediction.

Figure 5.7: Ground truth and prediction mistakes. As reminder, the state 0,1,2 represent the travelling, loading, and unloading process, respectively. Blue line denotes the ground truth. Color points show the predict results which are different as the ground truth. As can be seen, most of them are at the place where state changes. This indicates the further accuracy improvement may not make sense. Since the sample frequency is 5 Hz, the windows around 2,250th reveal the information about 450 s in the original data shown in Fig.5.3. The deep learning model is robust though there are some mistakes in ground truth dataset.

or travel since the features are vague in this region. When I further draw all the falsely recognized time windows, I find that almost all of the mistakes occur when the state is really fuzzy.

One exception is the windows around the 2,250th window in Fig. 5.7, corresponding to the measurement data at 450 seconds, see Fig. 5.3. As mentioned before, the time windows at 450 seconds are the windows that I consciously mislabeled the data to the travel process state. This proves that the CRDNN has a robust performance even if a few data is mislabeled. That is, the CRDNN with two bidirectional LSTMs accurately identified that the process is actually an unloading process rather than a travel process. Therefore, I believe that about 98% is the best number since different engineers define the ground truth differently with their plausible reasons. Moreover, although a tiny number of mislabeling might harm the test accuracy. However, it cannot affect the prediction performance of the CRDNN.

## **5.8 Fast CRDNN**

The CRDNN is a combined neural network that can accurately detect the truck loading cycles of torque based mobile working machines. On the one hand, it is a robust but offline learning algorithm so that it is more accurate and much quicker than the previous methods. However, on the other hand, its accuracy cannot always be guaranteed because of the diversity of the mobile machines industry and the nature of the offline method. To address the problem, I utilize the transfer learning algorithm and the IoT technology. Concretely, the CRDNN is first trained by computer and then saved in the on-board ECU. In case that the pre-trained CRDNN is not suitable for the new machine, the operator can label some new data by my App connected to the on-board ECU of that machine through Bluetooth. With the newly labeled data, I can directly further train the pre-trained CRDNN on the ECU without overloading since transfer learning requires less computation effort than training the networks from scratch. In this chapter, I prove this idea and show that CRDNN is always competent, with the help of transfer learning and IoT technology, by field experiment, even the new machine may have a different distribution. Also, I compared the performance of other SOTA multivariate time series algorithms on predicting the working state of the mobile machines, which denotes that the CRDNNs are still the most suitable solution. As a by-product, I build up a human-machine communication system to label the dataset, which can be operated by engineers without knowledge about AI. This paragraph is an abstract to help the reader to understand the context, the details will be described in the following sections.

#### **5.8.1 What is CRDNN?**

As aforementioned, CRDNN is a neural network that combines the CNN, RNN, and DNN. The combination brings the advantages of different kinds of neural networks together [139]. Pressure inside the bucket (pbu), vehicle velocity (vveh), vehicle direction signal on the joystick (ujs), pressure inside of closed-circuit drivetrain (pcc), and pressure inside of bucket (pbo) are collected during the wheel loaders are working in Y cycles. I labeled the data with corresponding working state, traveling (e0), loading (e1), and unloading (e2). I then trained my neural network on the computer with these ground truth data. In order to find out the best model for the task, I have explored many different kinds of networks, such as CNNs, RNNs, DNNs, and their combinations. Among these neural networks, the combined neural networks CRDNN with two LSTM layers performs excellent test accuracy with relatively low training parameters<sup>3</sup> . Moreover, the robustness of this model to the small amount of mislabelled data is also the reason for the final selection. I saved the trained CRDNN in the on-board ECU, and CRDNN can rapidly identify the working state with high precision and recall. The model is built with Kereas API in Tensorflow [142]. A more detailed description of the CRDNN and how the dataset [143] was built up can be found in my previous study [63].

#### **5.8.2 Motivation of Fast CRDNN**

In the previous study [63], the CRDNN shows excellent performance in detecting the Y cycles of primary-torque-based mobile machines. To date, I believe that CRDNN is a promising method to solve the problem. Firstly, it is an offline approach that can be an order of magnitude faster than the other online learning methods. Also, it achieves a better performance on the challenging dataset by taking the time-series signal sequence into account. However, due to the diversity of mobile machines and driver behaviors, the accuracy of prediction is not always

<sup>3</sup> The graphical description is Fig. 5.4

so satisfying even the CRDNN is used. The performance of CRDNN decreases when it faces measured data from a driver with totally unseen behaviors, which means the distribution of data gathered from the new machines and drivers are different from the previous dataset used to train the CRDNN. The reasons are apparent. First and foremost, the CRDNN is an offline learning method that can not automatically adapt to the new tasks after it has been trained. Also, the gather of all the data in every scenario for the initial training is still challenging and, of course, economically impossible. Therefore, in this paper, I utilize the transfer learning and IoT technology to solve the problem. The pre-trained CRDNN will be further trained in case that the machines or drivers have totally different features, and the recognition system can then reach the expected performance. Apparently, establishing the communication interface between humans and machines plays a vital role in this approach. Therefore, this communication interface is also introduced in Section 5.13.

Figure 5.8: Graphical abstract. Here the core data is a large dataset that contains 119 Y cycles data from many wheel loaders. This core dataset is used to train the base network. Thanks to this base network, I can then use transfer learning to adapt the weights in this base network with the new data to improve the generalization ability, easy and quick. The method is proposed to solve the problem pointed out by many machine learning researchers, the distribution of the source data may differ from the target data since the collection of a comprehensive dataset is, in many cases, impossible.

The main contributions of the following paragraphs in this chapter can be sum up as the following points:


The rest of this chapter is organized as follows. Section 5.9 and Section 5.10 briefly introduce the prerequisite and background knowledge in fields of time series classification and IoT to understand this chapter since my readers might come from these fields. Next, the existing problems and proposed solutions are illustrated in Section 5.11. Then, the reasons why I adopt these solutions are provided in Section 5.12. After that, in Section 5.13, I describe the connection system between the human and the mobile machines. In Section 5.14, I show how the measurement setup. Followed by Section 5.16 and Section 5.17, I compare the variations of CRDNNs with the SOTA TSC solution, and the performance of different transfer learning methods. Finally, Section 5.18 gives conclusions of this chapter.

## **5.9 Long Short Term Memory Fully Convolutional Network: a SOTA Solution for TSC Tasks**

Long Short Term Memory Fully Convolutional Network (LSTM-FCN) is designed for classifying univariate time series [144]. In order to apply this network to the multivariate time series classification problem, Karim extended the Squeeze-And-Excite (SAE) block to the case of 1D sequence models and augmented the fully convolutional blocks of the LSTM-FCN model to improve classification accuracy [5]. The network architecture is shown in Fig. 5.9.

Figure 5.9: LSTM-FCN with squeeze-and-excite block [5].

Fully Convolutional Networks (FCN) have proven to be an effective learning model for TSC problems [145], which comprised of three temporal convolutions, are typically used as feature extractors. Global average pooling [146] is used to reduce the number of parameters in the model before classification. The SAE is added after the FCN block which adaptively recalibrates the input feature maps [5].

This architecture has been tested on 35 benchmark datasets for TSC, and it outperforms the other SOTA models on at least 28 datasets [5]. Thus, I would like to compare the CRDNN with this algorithm for the task of detecting mobile machines' Y cycles.

## **5.10 Wireless Human Machine Communication**

To achieve the smart working site, effective communication among mobile machines is an inevitable vital step. Since the mobile machines are very likely to work at a place where there is outside of the coverage of the base station, I utilized the ad-hoc network as the first version for the fleet management of the mobile machines [40]. In that chapter, although the realtime communication system is proposed, the bidirectional communication between human and mobile construction machines is still a gap. Recently, many other scientists also emphasize the value of setting up the management system between operators and machines [147]. However, they did not consider the rapid development of the new technology on mobile smartphones and consequently did not develop core functions on the smartphone. Based on the research from Ignatov, the capability of the system on a chip (SoC) on cell phone grows extremely fast and research almost 40% velocity of Geforce GTX 1060 in terms of processing images [148]. Hence, I would like to build up a connection between cell phones and the mobile construction machines to take advantage of the cell phone SoC industry's development. The top SoCs until April 2020, A13 from Apple Inc, Snapdragon 865 5G from Qualcomm, Kirin 990 5G from Huawei, Exynos 990 5G from Samsung, claim that their SoCs can be about 20% faster compared to their last generation published in the last year. Also, the newest version SoCs equip with GPU to enhance the capability to deal with artificial intelligence tasks. All of them have Bluetooth 5.0 modules that can easily connect to the mobile construction machines onboard ECU. Apparently, the development of the computational performance of SoCs is much faster than the onboard ECU.

## **5.11 Problem Statement and Brief Description of the Solution**

As my first version of CRDNNs, the CRDNNs can easily achieve predictive accuracy of about 98% based on the dataset of 119 Y cycles, which reaches human-level performance. However, when I consciously change the equipment, especially the shovel, of those mobile machines and then test the CRDNNs, the performance is degraded to an unacceptable level. Fig. 5.10 illustrates the performance of CRDNN when it faces measurement data from a driver with totally unseen behaviors, and the implements has been changed. The reason for that is the training data and the test data have a different distribution in both marginal distribution and conditional distribution.

Figure 5.10: The confusion matrix of CRDNN on new data. The e<sup>i</sup> are the ground truth and the eˆ<sup>i</sup> are the predictive state. As defined in my previous work, the e1,2,<sup>3</sup> denote the state travelling, loading, and unloading, separately.

In addition, since the mobile machines are rent for construction tasks and count money by time, the robustness of the machines and the algorithms on the machines is a matter that cannot be negotiated for the contractors. Thus, either an approach that can always guarantee the performance of the algorithm without adjustment or an approach that only requires rapid and easy calibration is needed as a complementary solution. Even worse, OEMs are reluctant to share their data with each other resulting in a lack of training data for all of them. Based on the facts and challenges I analyzed, I select the approach of offline learning with online adaption. Concretely, instead of sharing the real measured data, transfer learning allows them to further train the pretrained base neural networks with a small new dataset and thus have a similar effect as they gain a series of data and train the neural networks.

As we know, data plays a critical role in deep learning. A large and highly diverse dataset improves the capability of machine learning methods. Also, the same distribution and feature between the training data and test data are a guarantee for the excellent performance when the neural networks are applied in practice. However, in the real world, there are many different kinds of construction machines and workplaces, which may lead to the change of the data distribution. Since the collection of the dataset from all kinds of construction machines is almost impossible, I adopt the transfer learning method to guarantee the same data distribution of the training and test data. Since there must be some similarities between the data collected from the previous wheel loaders and the new machines, it is likely to do fine-tuning with labeling a few datasets on the working site, and it will only take a few training steps to achieve the satisfying prediction results<sup>4</sup> . Thus, it is not computationally expensive and can be trained directly by the onboard ECU or smartphone. Notice that, whether the new data should be trained on the onboard ECU or the SoC in the cell phone is depending on the capability of them and the bandwidth of the connection. At present, I recommend further train the CRDNN on the onboard ECU since the transmit of the data from mobile

<sup>4</sup> Strictly speaking, this is a hypothesis until now. However, this will be proved in the following context.

construction machines to cell phones has a more massive amount of data as in reverse. However, the approach introduced in this chapter can be easily adapted to the version that trains the CRDNN on the cell phone at the time when the data transmission is proved as no more a problem.

## **5.12 Why Transfer-Learning Based Supervised Learning?**

Traditional machine learning performs well by using training data and test data with the same input feature space and the same data distribution. When there is a difference in data distribution between the training data and test data, the results of a predictive learner is likely to be degraded [149, 150, 151]. In certain scenarios, obtaining training data that match the feature space and predicted data distribution characteristics of the test data could be difficult and expensive. Therefore, there is a need to create a high-performance learner for a target domain trained from a related source domain. This is the motivation for transfer learning [152]. Transfer learning is used to improve a learner from one domain by transferring information from a related domain [153, 154] .

Since the transfer learning is a rapid developing subject, the terminology and definition have currently no consistency. In this chapter, I use the mathematical definition from Pan for further discuss, who defined that D<sup>s</sup> = (Xs, P(Xs)) as source domain, D<sup>t</sup> = (Xt, P(Xt)) as target domain, T<sup>s</sup> = (Xs, fs(·)) as source task, and T<sup>t</sup> = (Xt, ft(·)) as target task. Transfer learning aims to enhance the learning of the target predictive function ft(·) in D<sup>T</sup> using the knowledge in D<sup>s</sup> and Ts, where D<sup>s</sup> 6= Dt, or T<sup>s</sup> 6= T<sup>t</sup> [155].

In the past decade, transfer learning has been successfully implemented in the fields of image recognition [156, 157] and Natural Language Process [158]. In contract, scientists in the field of TSC believe that there has a lot of things should be proven or improved [159]. It is only recently that deep learning was proven to work well for some TSCs [160]. However, unlike image recognition [161], the lack of a sizeable general-purpose dataset in TSC limits the development of transfer learning in TSC. Another well-known problem by implementing the transfer learning on the TSC task is the negative transfer. As we know, if one is good at handball, she or he can learn how to play basketball faster than the others who never played handball before. The reason is apparent: the knowledge about how to play handball and basketball well are similar. However, people usually have a negative evaluation of the people who give them a bad first impression (Ds), no matter how other people change (Dt). For the latter example, the first knowledge (Ds) does not contribute to the correct prediction (ft(·)) and indeed has an adverse effect. This is a negative transfer. The negative transfer and how transferable are features are still very active research domain [162]. Fawaz has revealed that transfer learning can both improve or degrade the model prediction depending on the source dataset (Ds) [163], by testing the performance of FCN algorithm [145] with transfer learning and from scratch on a series of dataset. To the best of author's knowledge, the consensus is that transferring models between similar datasets improves the ft(·) performance. In contrast, Rosenstein empirically shows that if two tasks are too dissimilar, then brute-force transfer may hurt the performance of the target task [164]. Thus, Mahmud proved some theoretical bounds by analyzing the case of transfer learning using Kolmogorov complexity [165]. Furthermore, some previous works have been exploited to analyze relatedness among tasks by using clustering techniques, which provide the guideline about how to automatically avoid negative transfer [166, 167]. Keogh shows that dynamic time warping is a robust distance measure for time series, which can thus evaluate the similarity of the dataset [168]. Based on the literature recherche in the field of transfer learning, I can conclude that the more similarities between the (Ds) and (Dt), the better transfer learning can perform.

There are different strategies and implementations for solving a transfer learning problem. The majority of the homogeneous transfer learning solutions employ one of three general strategies which include trying to correct for the marginal distribution difference in the source P(Xs) 6= P(Xt), trying to correct for the conditional distribution difference in the source P(Ys|Xs) 6= P(Yt|Xt), or trying to correct both of them [169].

Some similar use cases for TSC with transfer learning can be found in many previous studies. For example, Hu explored first to train a model on the historical wind-speed data of an old farm and fine-tune it using the data of a new farm [170]. In addition, Peng propose a transfer-learning based approach to establish an anomaly detection model for dangerous actions of aircraft testing flights [171]. A transfer learning-based bi-directional long short-term memory model is proposed to predict the air quality by Ma [172]. The success of the implementation of transfer learning on TSC tasks encourages us to follow this concept.

In my transfer learning task, the data I used to pre-train the base network from scratch is the source domain (Ds), while the data I collect from the new machines are the target domain (Dt). Apparently, the solution to this problem is to correct both the marginal distribution and the conditional distribution difference in the source. It can be referred to as a parameter-transfer approach, which assumes that the source tasks and the target tasks share some parameters or prior distributions of the hyper-parameters of the models. My transfer learning approach is to recompute the trainable parameters in the neural network. The architecture of the base network will be kept the same.

Another potential approach is also mentioned, which could be used to detect Y cycles: semi-supervised sequence learning, which leverages the unlabeled data to further improve the predictive accuracy [173]. However, the performance of semi-supervised learning is quite difficult to outperform supervised learning [174]. This method is usually adopted for the private data task, where label the data is prohibited [175]. In the case of detecting Y cycles, obtain the new data is actually only a technical problem, and the data must be much easier to get in the era of IoT; thus, I would use supervised learning instead. To achieve the transfer-learning based supervised learning, I have designed a connection system between the mobile machines and human using smartphone.

## **5.13 Connection System Design**

## **5.13.1 Choice of Wireless Communication Technology**

There are mainly four common short-range wireless communication technologies in the field of IoT, namely Near-Field Communication (NFC), Radio-Frequency Identification (RFID), Bluetooth, WiFi. The comparison of their main specifications are shown in Tab. 5.2.

In order to enhance the generalization capability of CRDNN, I need to get the newly labeled data to train the pre-trained base network further. The new data is labeled through the mobile app, which connects ECU through the Bluetooth. With the new labeled data, the network is retrained on the ECU, and the accuracy of the retrained network can be shown in the app. When the test accuracy reached the expectation, the machine can be put into use.

Considering that most machine operators are not specialists in deep learning, I design the interface as naturally as possible. I find that only two tasks must be done manually: labeling the data and check the confusion matrix. The other steps will be done automatically either by the APP or the ECU.

Each of those technologies has its pros and cons, and can be implemented into different scenarios. NFC can be easily used for transactions, but not for on-site training due to the limited range, which is approximately 10 cm. RFID technology provides a reliable, efficient way to transmits the identity of an object [176], so that it is widely used in the area of the E-ZPass system [177]. However, RFID only supports the one-way transmission, and therefore it is not a solution for my use case. Compared to WiFi, Bluetooth has a lower energy consumption and more straightforward hardware implementation [178]. Therefore, I select Bluetooth for the on-site training. To date, the latest version in Bluetooth is Bluetooth 5.0, which is introduced by the Bluetooth Special Interest Group (SIG). This version offers significant enhancements compared to the previous specifications, regarding a broader range up to 200 m, a faster speed up to 2 Mbps, and more robust to interference [179].



5.13 Connection System Design

### **5.13.2 User Interface of the System**

The on-site training system is presented in Fig. 5.11. Following, I describe the process that fine-tuning the model on the onboard ECU. The system consists of a mobile smartphone for labeling date manually and the mobile construction machine, which is equipped with Bluetooth Low Energy (BLE) transceiver chip for communicating with the mobile device. The construction machine operator installs my "Smart Working Site" app, which is demonstrated in Fig. 5.11. The app provides four perspectives, namely "Connect machines", "Label the Data", "Advanced Settings", and "Test Accuracy". At the beginning of the on-site training, the machine operator shall activate the Bluetooth of the smartphone and pair the construction machine, as long as the construction machine is situated within the Bluetooth coverage of the smartphone. Then, the operator observes and records the construction machine's actions, as the driver starts the construction work. The machines' working states are transmitted to the on-board ECU intermediately, once one action is labeled. This time series of labels indicates the current action of the machine and is served as ground truth for transfer training of the network. For those who are familiar with neural networks, they can tune the hyperparameters as well as different learning algorithms to retrain the network in the tab of "Advanced Setting". However, use the model I recommend in this chapter can fix most of the problems; thus, I do not suggest to use the advanced function on the smartphone unless the operators are extremely confident. The hyperparameter "epochs" indicates the number of loops, in which all the training data are fed to the network. The other indicator "weights" means the priority of each working state to be correctly predicted. As the last step, Onboard ECU retrains the network work and transmits the accuracy back to the app, which is visible in "test accuracy". Once the performance is satisfying, the retrained neural network is applied to the machine.

Figure 5.11: The sketch of the human-machine communication App system.

## **5.14 Measurement Setup**

To simulate the situations which the OEMs are likely to meet, I consciously change the control algorithm of the implement, and also the size of the shovel. In fact, in order to adapt to different tasks, OEMs will modify a different control program to facilitate the driver's operation. Also, the machines have different sizes for the different working sites; among these differences, the most considerable distinction is the shovel sizes. Therefore, my measurement is set up based on these facts.

Fig. 5.12 shows the mobile machine which I used to gather the new data. Thanks to the dSpace, the control algorithm can be changed on this prototype mobile machine with ease.

Figure 5.12: The mobile machine used for the measurement data.

The newly gathered measurement data, including 24 Y cycles, are partly shown in Fig. 5.13. By observing the newly gathered data Dt, I find that the driver operated joystick differently<sup>5</sup> , compared to the driver who created the original dataset Ds. The dataset is normalized to accelerate the training process so that the influence of the varying of the shovel dimension might not be shown clearly.

In order to simulate the fact that different engineers may have divergence on how to label the data since they have different standards or rules, I consciously label the newly gathered data in another way as the previous study [63]. For the new dataset, I label the sample into the state traveling whenever the dpbu is still fluctuating, which is different from the previous approach. Consequently, the distribution of the new dataset has also changed, so the marginal distribution of the source data and target data is much different. To sum up, for the new

<sup>5</sup> The control strategy for the shovel is modified for other projects with other purposes. The system is more sensitive.

Figure 5.13: New measurement data with a different implement control algorithm and dimension of the shovel. The last subfigure shows the ground truth state of the measured data.

measurement, I purposefully chose a different driver, a different control algorithm for the implement, a shovel in a different size, and a different engineer to label the dataset. Although this makes the task more challenging, I believe it is closer to the reality and should be taken into consideration.

#### **5.14.1 The Sliding Windows Labeling Method**

After the raw data are gathered and labeled, I need to split the time series data into some small sliding windows to train the neural networks. I sample the data in 5 Hz to avoid overloading the ECU. Obviously, the window sizes affect the system performance; the larger the window sizes are, the more information will be taken into consideration, and thus more accurate can be expected. However, a larger window size may result in a delay between the state occurs and the machine detects the state. Following, I illustrate the mechanism of this delay.

#### **Labeling the Slide Windows Based on the Whole Data**

I do not use the state of the last sample data in the sliding window as the state of the slide windows, because the time point where the state changes are vague. Thus, I believe that I should not label the slide windows only based on one sample data in it. Another drawback of only using one sample data is, the consequently labeled sliding windows can make neural network confusion since most of the sample data in this sliding window might indicate another state.

Here I set the slide windows length as 15, which means a sliding window contains 15 sample data with the label. In order to label these sliding windows, I calculate the distribution of the samples. In this fashion, the slide windows must have an odd length. In case that one state has the majority, I can then set these windows as this state. For example, if 7th sample data is labeled as loading and 8th sample data is labeled as traveling, the sliding window will be labeled as traveling since traveling is the majority. However, in this case, the state traveling occurs at the 8th sample data, and the machine detects the sliding window as traveling when the 15th sample data is measured. Therefore, a delay exists principally by this method. The method can be explained by Fig. 5.14 concretely.

#### **Labeling the Slide Windows Based on the Partial Data**

The previous labeling method supplies a reasonable method to label the slide windows. However, the larger the window sizes are, the longer the delay will be. In contrast, if I label the slide windows based on the partial sample data in the windows, the problem can partly be solved. Concretely, I use the last three or five sample data in the sliding window to label this sliding window, as shown in Fig. 5.15. In this vein, the delay can be reduced.

Figure 5.14: The diagram of the relabeling method.

Figure 5.15: The diagram of the relabeling method.

## **5.15 Comparison Bewteen CRDNN and Other SOTA Time Series Classification Neural Networks**

Before I show the benefits of transfer learning, I first determine which neural networks should be used as the base network. As mentioned in Section 5.9, LSTM-FCN is considered as a SOTA solution for TSC tasks. In this section, I would like to compare the CRDNN with LSTM-FCN with respect to micro F1, training time, and test time. Here the training time indicates whether the algorithm is suitable for immediately fine-tuning on the working site. The test time shows if the algorithm is appropriate for realtime detection. My base networks were trained on Nvidia GEFORCE GTX 1050 GPU. In order to find the global minimum rather than the local minimum, I use early stop and set the patient to 100, which means the training process will be stopped 100 epochs after finding the best predictor. To further avoid overfitting, I adopt the L2 regularization method the same as my previous study. The optimizer I use is ADAM [180]. Also, I use ReLU as my activation function since it can be trained faster as Sigmoid. In Tab. 5.3, I demonstrate the performance of different neural networks with different window sizes. Here I use the previous dataset to perform the process of selection of the base networks so that the selected base network can be directly used in the next section where the performance of transfer learning will be discussed. If the model mispredicts the unloading process into the loading process or in reverse, a complicated operation strategy must be designed. Therefore, I only select the models which do not make mistakes in classifying the loading state into the unloading process or in reverse. Among them, CRDNN with 2 LSTMs with WS 15 has the shortest training time and test time. The training time is 310.19 seconds. Compared to LSTM-FCN with WS 15, it needs only one third training time. Although the micro F1 is slightly worse than LSTM-FCN with WS 15, less than 1%, I believe that a much shorter training time conducive to a better performance in transfer learning with respect to efficiency. Moreover, in case that I want to increase the micro F1, I can either increase the WS, or use the other variances of CRDNNs, the one with bidirectional LSTM, to achieve almost the same micro F1, whose difference is less than 0.1%. Notice that I do not further pursue to increase the micro F1 since 98% is already the best performance, and thus a further increment might not make sense, i.e., achieving this value, the deep learning model only makes mistakes at the place where different engineers have different ideas to define the current ground truth state. Interestingly, although the micro F1 increases as the WS increases, the training time does not always increase as the WS increases. In short, based on the training results, the LSTM-FCN has a slightly better performance than the CLDNN with 2 LSTM layers and CLDNN with both one bidirectional LSTM layer and one LSTM layer; however, the training time of the LSTM-FCN is enormous pressure for the ECU when I make a transfer learning on the ECU. Thus, I select CRDNN with 2 LSTM layers as base networks for transfer learning. Also, I select the WS as 15 according to the training results.



(b) Second part of the table

Table 5.3: Performance analysis. The performance of the five network structures in respect of total training time (s), micro F1 (%), average test duration (ms), and whether it can never mistake an unloading into loading or in reverse

## **5.16 Transfer Learning Based CRDNNs**

Since I do not change the model architecture, there are two potential transfer learning methods: either I can freeze the former parts of CRDNN and only further train the fully connected layers to save the training time, or I can use the pre-trained model's weights as the initial parameters for the further training of the total model. Obviously, the first vein is faster and can mitigate the ECU computational effort. Yet the second way may achieve a better recognition performance. Generally speaking, I can only use the newly gathered data as the validation set, just like other transfer learning tasks did. However, from the users' view, I evaluate the performance both on previous data (Ds) and new data (Dt). To evaluate the accuracy of each approach, I first show the micro F1 value and then illustrate the CMs. Furthermore, to indicate whether a method is suitable for on-site transfer learning, I judge the approaches based on their training time (back-propagation) and test time (forward-propagation). Here I show the training time and test time on a CPU core i7 4720HQ@ 2.6 GHz since the results are more appropriate to be used as the benchmark for the onboard ECU. The hyper-parameters and the architecture are shown in Tab. 5.4, and the results are shown in Tab. 5.5, where the ND, PD, FS, FTF, OTF denotes newly gathered dataset, previous dataset, training from scratch, fully connected layers transfer learning, and overall transfer learning. For transfer learning, I reduce the patient to 50 for the purpose of achieving a relatively faster training process.


Table 5.4: Parameters of CRDNN with 2 LSTMs

### **5.16.1 Training from the Scratch as Benchmark (ND+PD+FS)**

In order to have a basic overview between the CRDNN trained from the scratch and the CRDNN trained by means of transfer learning, I demonstrate the training process of the CRDNN trained from the scratch and the CRDNN with the method

(b) CM trained from scratch with new data.

(a) CM trained from the scratch with previous and new data.

Figure 5.16: Cost versus epoches of CRDNNs from scratch and with transfer learning.

transfer learning, separately. The CRDNN from scratch is used as the benchmark to illustrate the benefits of transfer learning. By means of training from scratch, each epoch contains 143 Y cycles data since I mixed the newly gathered and the previous dataset together, and the training process can stop at about 75 epochs, as shown in Fig. 5.16(a).

### **5.16.2 Only Further Train the FCN (ND+FTF)**

Because the DNNs can be trained faster than CNNs, I firstly only train the final fully connected layers in the CRDNN, and then analyze the performance. As can be seen, the model was further trained with the new dataset. Here each epoch


Table 5.5: The performance comparison between different training methods

has only 24 Y cycles, and the training process stops at about 60 steps. After the transfer learning process, it can be seen that the prediction accuracy is much better than the results shown in Fig. 5.10. Concretely, the micro F1 is increasing to about 95%. However, I can utter that the results are satisfying but not perfect as the totally retrained CRDNN. As shown in Fig. 5.16(c), the current neural network is lack of learning ability for further improving the performance of the CRDNN since the validation cost does not change during the training cost goes down.

### **5.16.3 Train the Total Part of CRDNN (ND + OTF)**

Fig. 5.16(d) is the result when I further train all the parts of CRDNN with the newly gathered data. Obviously, the CRDNN has a stronger learning capability compared to CRDNN with FTF since the test cost goes down deeper as the epoch increases. The micro F1 of the CRDNN with OTF is higher since the state traveling occupies a majority of my dataset. In order to let the newly trained model can also have a good performance on the previous data, I introduce the soft weight sharing method that uses different learning rates for different layers of neural networks. Concretely, I let the learning rate for the CNNs and RNNs smaller than the DNNs.

(b) CM trained from scratch with new data.

(a) CM trained from the scratch with previous and new data.

Figure 5.17: Confusion matrices of CRDNNs from scratch and with transfer learning.

### **5.16.4 Evaluation the Benefits of Transfer Learning**

The performance of these four methods is shown in Tab. 5.5. Since the first three methods are trained on the newly gathered data, the samples per epoch are much fewer than the fourth methods. Also, in case that I only train the fully connected layers, the trainable parameters are the fewest. Both of them are good for reducing training time. Thus, the training time for ND+ FTF can reduce to one-tenth (10%) compared to ND+PD+FS, and one third (35%) compared to ND+FS. Here the training time is 62.90 s; however, since the model can be trained on different onboard ECU or smartphone, the concrete numbers shown in the table are only made sense to be used as a benchmark to compare the performance of one approach to the other approaches. For instance, some ECUs on the mobile machines may have a relatively lower computational capability resulting in 10 times longer training time than the value here shown. Also, it is possible that the onboard ECU is even faster than this training time because mobile machines usually have a powerful energy source. Based on the comparison, I use the method ND+FTF as an emergency method to let the model can work with high accuracy immediately on the new task after new labeled data are fed into the model. ND+FS shows a great accuracy on the new data; however, since the newly gathered dataset is relatively small, the generalization capability of this approach is suspicious. The other transfer learning method, overall transfer learning, has the best performance on new data. It is also good at detecting the previous data, which indicates that it has a good generalization capability. Moreover, the training time is only one third (33%) compared to the ND+PD+FS. Therefore, I recommend using ND+FTF to train the network in the case that it is not so hurried or the mobile machine has a relatively powerful ECU. Note that the micro F1 of ND+PD+FS is slightly worse than the results in Tab. 5.3 because the patient is fewer. As shown in the CMs in Fig. 5.17, the models do not mistake the loading process with the unloading process, which denotes that all the models can mitigate the design of operation strategy; thus, all of them have the potential to be used, in case that OEMs have their special wish.

To illustrate the mechanism of time saved due to offline learning with online adaption compared to pure online learning, I show the training process of transfer learning. As shown in Fig. 5.18, the training and validation cost on precious explode at the epoch 63. The base network on the computer is finished at step 63 since the validation cost begins to grow. Hence, at this time point, I added the new labeled dataset to simulate the real scenario for transfer learning. Right after the new data are considered, both training and validation cost goes to an extremely high level since the dataset has an enormous variance. Consequently, the prediction results must be unsatisfying. Interestingly, only after a few steps of

Figure 5.18: The mechanism of transfer learning. The blue line is the training cost on (Ds), the red line is the validation cost (Ts), the purple line is the training cost on (Dt), and the cyan line is the validation cost (Tt).

further training on the on-board ECU, the cost goes dramatically down to the low level. As a result of that, the CRDNN is again suitable to predict the truck loading process even when the scenario is quite different from the original dataset.

Based on the results of this section, I find that transfer learning is a powerful tool to let the CRDNN be robust to the challenging Y cycles detection tasks. The transferlearning based CRDNN with 2 LSTMs is the most appropriate model since it can be retrained much faster than LSTM-FCN with only 1% accuracy lost. Without transfer learning, the model can not guarantee excellent performance for the new target task (Tt); thus, I recommend using transfer-learning-based CRDNN for the detection of Y cycles.

## **5.17 The Advantages of This System from Engineers' View**

Here I would like to sum up the main advantages of the transfer-learning based CRDNN and the corresponding IoT system as strong, fast, and easy.

## **5.17.1 Strong**

This system is aimed to improve the efficiency of the novel torque-controlled hydrostatic mobile machines by correctly detecting the working process. This system can automatically recognize the working state without an additional button or human action, which offers essential information for the energy regeneration process. Thanks to the transfer learning, the system can be adapted to a new machine, even where there has a different distribution as the source dataset, without a complicated calibration process. The test accuracy of this working state recognition system can reach 98% on the challenging dataset [63], which achieves the human-level performance and guarantees accurate recognition. The strong ability of generalization of transfer-learning-based CRDNN is proven.

### **5.17.2 Fast**

Usually, an excellent ability of generalization is based on the sacrifice of speed. However, the transfer-learning-based CRDNN is fast. It is an offline method with online adaption; thus, it is a realtime algorithm. Also, transfer learning needs much less computation effort resulting in the on-site training capability of CRDNN.

#### **5.17.3 Easy**

Generally speaking, an interface that controls an extensive system is complicated. However, the UI of the IoT system designed in this chapter is easy to use. The operators only need to give the data the appropriate label and check the model accuracy. The system automatically does most of the training steps.

At the end of this section, I demonstrate the performance of different approaches in Tab. 5.6, where the online learning approach was evaluated with the batch size is equal to 1.


## **5.18 Conclusion**

In this chapter, I have shown that CRDNN with bidirectional LSTMs has the best performance to detect the truck loading cycles, and the CRDNN with 2 LSTMs has the best performance-cost ratio if primary torque control concept is used. Because I use an offline learning strategy and the forward propagation is much faster than backward propagation, this method will not take up too much computational effort. By considering a period of 5 seconds, the test accuracy reaches 98.2%, and it never mistakes the loading process with the unloading process or vice versa, which makes the operation strategies easily to be implemented. Also, since I have a large dataset, a tiny mislabel could not harm the real performance of the CRDNNs. It is also worthy to point out, although CRDNN has only increased the test accuracy by 3%, it increases the most challenging 3% by successfully detecting the state of the data gathered when the drivers did not operate well.

Afterward, I update the naive CRDNN to the transfer-learning-based CRDNN. Thanks to the transfer learning, the generalization ability of CRDNN has been much enhanced so that it becomes a powerful solution for solving the high variance problem in detecting the truck loading process. Since transfer learning needs new data, I completed the IoT system of mobile machines by building a humanmachine communication system on the smartphone for the purpose of gaining the data quickly. The model I recommend can be trained very fast so that the workers can adapt the model directly on the working site after gaining the new data rather than sending the data to the deep learning specialist. As the results shown, the proposed methods can always help the pre-trained CRDNN to achieve satisfactory performance with respect to precision and recall. Besides, the training time on the onboard ECU can reduce at least about 70% to 90% compared to if I retrain the neural network from scratch on the onboard ECU. Also, I use the new method to label the sliding window, so that I can partly solve the delay of the prediction results in the previous version of CRDNN.

As a result of successful detection of truck loading process, I envision the working process based motion prediction methods.

# **6 Visual Monitoring of Working Site<sup>1</sup>**

Although the AI MAPF algorithm guides the machines to move efficiently, there is a potential risk. Since the machines' location are gathered with GPS/IMU system and then send to AI system, participants without localization equipment are ignored by the system. Consequently, tragedies might happen. Hence, a visual monitoring system is introduced in this chapter to monitor working sites. Current computer vision algorithms have shown excellent performance in detecting many common objects. By testing on well-known datasets, the best algorithm until 2020 was proved to achieve about 0.6 mean Average Precision (mAP) with the incredible innovation and effort of scientists. However, for commercial use systems regarding personal safety, this value is not satisfying. Considering that all machines and workers are working on a closed site, I attempt to increase the detection performance through reasonable overfitting. In light of that, I created the MOMA dataset, including eight classes of commonly used mobile machines, which can be used as a base dataset to be extended with the onsite collected data and then to train the SOTA algorithms to detect mobile construction machines. The view of the gathered images is outside of the mobile machines since I believe fixed cameras on the ground are more suitable if all the interesting machines are working on a closed site. Most of the images in the MOMA dateset are in a real scene, whereas some of the images are from the official website of top construction

<sup>1</sup> Except some tiny modifications, all the figures, text, and results of the presented work in this chapter have been published in my preprint publications [62]. My contribution to the papers is summarized as 100% in terms of conception and methodology, 90% of literature review, 40% of realization, 30% of data collection and labelling, 50% of results visualization, and 95% of formulation.

machine companies. Also, I have evaluated the performance of YOLOv3 [181] on the selected scenario, indicating that the SOTA computer vision algorithms already show an excellent performance for detecting the mobile machines on a specific working site. The visual monitoring system compensates for the system deficiency in recognizing the participants without a location system, and works as a safety system.

## **6.1 Introduction**

The research on the fully and semi-automated driving mobile machines are prosperous in the past decades. Mostly, the introduction of novel technologies aims to increase productivity, enhance the safety of the workers, and reduce the cost of operation. Among these new contributions, computer vision has attracted the most significant attention. Thanks to the boom of the deep learning, recognition capability of artificial intelligence outperform human-level recognition for many tasks.

In the case of mobile machines, which usually work in a closed campus, making the autonomous driving of the mobile machines a level four task according to the standard from SAE [182]. Currently, there are a lot of significant deep learning methods to visually detect the objects of interest, such as YOLOv3 [181], Faster-RCNN [183], which achieve an appealing trade-off between speed and accuracy.

Without a doubt, a series of researchers in the field of construction machines have been explored the possibility of using computer vision technologies and deep learning to recognize mobile machines. Unfortunately, until today (July 2020), no well-known database containing common devices for mobile machines, such as excavators, wheel loaders, bulldozers, and dumpers, is published with easy access and can be downloaded directly. As we know, the success of deep learning mainly benefits from three aspects: the generation of large-scale datasets, the development of robust models, and a large number of computing resources. The absence of the dataset limits the development of autonomous driving or working of the mobile machines.

To avoid the paucity of well-annotated images about mobile machines in current public datasets, a specific dataset for mobile construction machines is created: the MOMA dataset. Here images from varying viewpoints, poses, partial occlusions, and changing the depth of field were collected. A diversity of eight common categories across 5,663 images was organized in the standard PASCAL VOC dataset. 19,977 object instances were labeled for the research in the dataset. Based on my challenging dataset, by adding only a few newly gathered data onsite to achieve good detection within the closed campus becomes possible. I anticipate spurring the mobile machine detection to a higher level with a well-prepared based dataset. Fig. 6.1 illustrates the samples inside of my dataset.

The highlights of this chapter can be concluded as follow:


Figure 6.1: Sample images in dataset MOMA, including 7 classes of construction machines as well as person with varying poses on the working scenarios. Images in column (a): objects in iconic view; column (b) objects under partial occlusion; column (c) objects in varying poses; column (d) objects in non-iconic perspective.

task and try to develop more suitable algorithms, making an appropriate dataset can be an alternative to achieve the high-level detection task.

• As the tasks in the field of construction machines are level four, adding some custom figures of machines that need to be tested into my dataset can surely increase the predictor's performance. Thus, I also developed the program to analyze the modified dataset.

The rest of the chapter is organized as follows: I first briefly introduce the previous studies of computer vision based algorithms and datasets for common objects and construction machines. I then present the MOMA dataset in the following section with detail. Next, I analyze the performance of YOLOv3 on the MOMA dataset and give the current feasible solution for the construction industry, i.e., show how to leverage the dataset to detect mobile machines in practical. Finally, Section 6.6 gives conclusions of this chapter.

## **6.2 Related Works**

#### **6.2.1 The Well-Known Datasets**

Data are the prerequisite cornerstone of deep learning because deep learning models directly get knowledge from the data. Although the importance of the dataset is not so significant before 2012 [184], the deep-learning community has the consensus that the data have been the vital driving force behind computer vision technology [185, 186]. To circumvent the bottleneck of limited data, both Acuna and Yu proposed a method to accelerate the human labeling process by their software [187] or a partially automated labeling scheme [188]. Besides these, Northcutt proposed an approach named Confident Learning (CL) to evaluate the quality of the data [189]. With the rapid development of computer vision, the dataset of image recognition is also enlarging at a rapid pace. For the classification, Caltech 256 is famous with more than 100 categories [190]. Also, many scientists published a classification dataset based on videos, such as [191]. Besides that, more commonly applied datasets for general purpose are created, including PASCAL VOC dataset [192], Microsoft COCO [193], and ImageNet [194]. Also, there are a lot of specific datasets for specific tasks, including pedestrian [195], scene parsing [196, 197], human activity [198, 199], and face recognition [200]. In autonomous driving, KITTI is considered the pioneer, which contains objects of interest in the realistic scenarios of city Karlsruhe [201]. Followed by Cityscape [202], RobotCar [203] have also contributed to the autonomous driving community with their diverse dataset. Since the aforementioned automated driving datasets are collected in European countries, the scientists in the other parts of the world also published their dataset with much larger sizes, concretely they are BDD 100K [185] in the USA, ApolloScape [204] in China, and nuScenes [205] in both the USA and Singapore.

By the comprehensive literature review, it can be concluded that a benchmark dataset should be diverse, abundant, consistent with the actual scene, and online release.

### **6.2.2 Recent Object Detection Algorithms**

Computer vision is a grand and long-standing subject. Before 2012, the most outstanding algorithms are based on hand-crafted features, such as Histogram of Oriented Gradient (HOG) [206] and Scale-Invariant Feature Transform (SIFT) [207]. In this period, a famous algorithm based on convolutional neural networks is LeNet [208]. However, due to the limitation of the level of computer computing technology at the time, it is quite shallow and with too few training parameters. Therefore, the advantages of deep learning at that time is not significant. Nevertheless, after the AlexNet [209] won the ImageNet challenge [194], so-called large scale recognition challenge, in 2012, the deep convolutional neural networks have attracted much research attention in recent years. In 2014, the VGG-16 [210] was proposed, and it is used as the base network for many applications. After that, the inception network [211], which combines the most of deep learning ideas, is designed. Among them, a particular form is called GoogleNet. Moreover, He shows that the neural networks can even surplus the human-level recognition [212], and he invented the ResNet [213], including the concept skip connection, making the training of much deeper neural networks possible, because the identity function is easy for the residual block to learn, in the same year. Usually, deep neural networks have a large number of training parameters and thus need plenty of time to be trained. To address this problem, transfer learning has got attention. The training time on the specific tasks can be dramatically reduced through transfer learning compared to if the whole model is trained from scratch. Therefore, instead of directly training the total model, most of the researchers download the pre-trained ImageNet models. Until the time I wrote the content of this chapter, the most well known and successful computer vision algorithms to detect objects are YOLO, RCNN, SSD, and their variations. Redmon developed from YOLO [214] to YOLOv2 [215] and then to the YOLOv3 [181], whereas RCNN [216] was enhanced to fast R-CNN [217] and faster R-CNN [183]. The comparison among these algorithms was made and can be found in many scientific papers, such as [218]: thus, here I only make a brief summary. Since YOLOv3 is a one-stage method and solves the task as a regression problem, it is quicker and famous for realtime capability. In contrast, faster RCNN adopts the region proposal network and achieve slightly higher accuracy in most competitions and tasks. In this chapter, I adopt YOLOv3 for my system since YOLOv3 reduces the burden of hardware.

## **6.2.3 The Previous Contributions on Detecting Mobile Machines**

To date, besides some common purpose, computer vision is used in many specific applications, such as airplane detection [219], ship detection [220], and of course, mobile construction machines.

The idea of using a camera to recognize mobile machines visually is not novel. To the best of the authors' knowledge, the first research can be traced back to 1990 when Eldin want to use a camera to increase the productivity of construction of a state prison in the USA. Before the rise of very deep neural networks, a series of researchers have already reached some achievements in these fields. Azar has developed a model for non-rigid equipment of excavators detection and pose estimation in construction images and videos [221]. In 2011, Chi used a background subtraction algorithm to extract motion pixels, which are then grouped into regions. After that, the group will be identified using classifiers [222]. The dataset, comprising of 750 images, is equally divided into three classes: skid steer loader, backhoe, and worker. It achieved overall classification errors of 3.9% with neural networks. The research also pointed out the similarities between loader and backhoe may cause worse performance. Both Park and Memarzadeh presented a method that can be concluded as a combination of HOG and the HSV color histogram, to localize construction workers or equipment in video frames [223, 224]. In 2014, Tajeen mentioned in their paper that they built an image dataset for construction equipment recognition, including 300 images [225]. After the Convolutional Neural Network (CNN) success, the application of the CNNbased object detection in detecting mobile machines and construction sites has been undertaking over the past decade. A consensus in the mobile construction machines industry has been built: for a variety of image recognition tasks, welldesigned deep neural networks have far surpassed previous methods based on artificially designed image features. Fang uses Improved Faster Regions with Convolutional Neural Network Features (IFaster R-CNN) approach to detect the excavators and workers in realtime on their own dataset [226]. Kim did both the research about scene parsing [227] and objects detection [228] of construction machines. In their following researches, the estimated context information was used to reduce the cost of the earthmoving process [229]. In 2019, Son used a very deep neural network to detect the workers in the working site, which was claimed to have yielded an accuracy of 91% and 95%, exceeding the SOTA descriptor in image target detection methods at that time. In his paper, he emphasizes the importance of varying poses and changing background [230]. Also, Son points out that the visibility of the equipment operator is inherently poor [231], which is consistent with my point of view. Recently, Bang proposed an image augmentation method to enhance the performance of objects detector on construction sites, achieving a recall of 66.76% and precision of 53.08% experimentally on the UAV-based resources [114].

Based on the literature review, I find that the research from Kim [228] who also aims to detect mobile machines is mostly similar to my research. Besides mobile machines detection, he claims that they have build up a dataset based on the images in the ImageNet. Since the R-FCN is powerful and the dataset is relatively large, i.e., 2,920 samples, no wonder they can achieve excellent performance. However, the dataset cannot be used as a based dataset for a reason. The background of the samples gathered from ImageNet is mostly not a real working site. This makes the winning algorithms in this dataset may not have an excellent performance in practice due to the dramatic domain shift.

## **6.3 Why I Created the MOMA Dataset?**

As I mentioned, for a level 4 task, the generalization capability of the detection model for the objects outside of the specific area can be ignored. Obviously, reduce the difference between the training data and test data can definitely improve the performance of AI model. Also, the size of training dataset should not be too small to lose the necessary information. Hence, I created the MOMA as a base dataset which shall be used to be mixed with the newly gained data directly from the working site that will be monitored. In this fashion, the data from the MOMA give more general information while the newly gathered data provide more test-data-related information.

## **6.4 The MOMA Dataset**

In this section, I first summarize my steps to build the dataset and then describe the details in subsections. The dataset MOMA is created as a specific dataset for commonly used mobile machines, which is challenging and diverse. There is one thing worth mentioning; I believe that the cameras on the construction site are more likely to be fixedly installed on the ground than on the driving construction machines. Because in most cases, construction machinery works within a limited range, making the configuration that install the cameras on the vehicle no more an inevitable method. In addition, the advantages of fixing the cameras on the ground are obvious. First and foremost, the cameras installed on the ground can provide the depth information from the figures with the appropriate calibration of the cameras. Here is the calibration process relatively easy since the coordinate among cameras is constant without vibration. Also, a wider angle of view and a cleaner lens can be achieved. The machines are usually surrounded by the dust during working resulting in the limitation of the vision. Thus, I prefer to select the images gathered from a perspective outside of the mobile machines, which is quite different from the self-driving cars' training images. In this fashion, wireless communication should be developed for information sharing between machines and cameras. These researches can be found in [64, 40]. Consequently, a diversity of eight common categories across 5,663 images was organized in the form of the PASCAL VOC dataset. 19,977 object instances were labelled for the research in the chapter, see Fig. 6.2.

Based on the survey about the most vital participants in the working site, I clearly defined the categories that I should focus on as the first step before collecting data. Unlike other categories, mobile machines vary to a certain extent depending on the components and working conditions. On top of that, human beings must be included in the dataset since most of the researchers believe the accurate detection of humans in the working site can improve security. Therefore I limited the species of detection tasks to common representative groups: excavator, truck, dumper, bulldozer, wheel loader, car, compactor roller, and person.

In order to guarantee the algorithms trained on my dataset can really have the best performance in practice, I collected the candidate images both from video frames and the official website of construction machines. The streaming video files were collected under the different real scenarios, which makes my dataset closer to the actual situation on the working site, and I then cut them into images. Besides the images from the videos, I also gathered some figures directly from the website of famous construction machines companies, such as Caterpillar, Komatsu, with the

Figure 6.2: The statistics of the MOMA dataset.

help of chromedriver and web crawler, since I believe that introducing these figures can enhance the performance of the predictors. Apparently, the figures from the videos are in a non-iconic view like the figures in MS COCO. In contrast, the figures from the official website of construction machines companies are canonical perspective as the samples in Caltech. Both of them make significant contributions to ensure a relatively high recall. Finally, for the visual perception task, more than 25,000 images were gathered, from which 5,663 representatives were selected.

Following, I annotated the ground truths of determined classes in the selected images. I use the annotation tools "labelImg", which is mainly for object detection labeling work from Lin [232]. The software can generate both XML files for Faster-RCNN and text files for YOLOv3. Since the XML file contains more information than txt, I save the dataset in XML and then transfer the XML into txt. As we know, labeling effort helps a dataset stand out in the training evaluation and detecting performance as well, whereas missing labels, false annotations, even widely unbalanced instance distribution, and too many clutters impair the effectiveness and robustness of a dataset. Therefore, before the dataset is fed into training models, it is worth analyzing the dataset by means of statistics and subsequently split it into subsets aiming to train the predictor and cross-validation. Whenever the dataset shows a significant imbalance among the interested categories, it would probably weaken the performance as a result. In this case, countermeasures such as label deficiency examination and then moderate supplement must be taken to keep the predictor robust against all classes. After careful preparation, the MOMA dataset basically does not have such a problem; however, considering that I need add onsite gathered data into my dataset, I have created the tool to evaluate the balance of classes in the dataset.

Besides the balance among different classes, it is quite necessary to have the right balance between training and test set to gain a stable estimation of predictor performance. With less training data, the trained model tends to have a bias problem. In contrast, less testing data will lead to higher variance concerning the performance statistics. I randomly split the dataset into trainval, i.e., training and validation, and testing by a ration of 4:1.

Finally, the richly-annotated dataset is tested by the SOTA object detecting algorithm, concretely, YOLOv3. In the meantime, whenever I find that the detectors do not work well for a specific situation, I increase the number of labeled images in that case into my dataset. In this fashion, I increase the diversity and scene variation of the dataset. In addition, the metric mAP is used to evaluate the detection performance. Here I use the recommended parameters and thus set the threshold of Intersection over Union (IoU) as 0.5. An Average Precision (AP) comparison with the best parameter settings is conducted across all selected categories.

### **6.4.1 Data Acquisition**

Thousands of images can be easily acquired as I have open access to a search engine and social media, e.g., Google and Flickr. Web images can be found and downloaded by crawling through websites. Hence, a scrapy crawler framework was built to grab pictures from Google search engine and mechanical engineering machinery websites. Special python scripts for each provider were created based on their site's HTML structure. By executing the python file created by Wang [233], images of interests from most pages on the website can be collected.

Nevertheless, most search engine based images present a canonical view of objects, which could bias the algorithm to assume mobile machines are always located at the center view. This may lead to a deviation from the predictors' optimal performance if they are trained only with these images. Despite their weakness in the real inference, the web-based images from various providers show diversity in size, luminance, resolution, color, background, as well as ambiguity and thus help models gain an understanding of essential object features. Moreover, in fact, most construction machines providers publish their new models timely on their website; thus, adding these figures can enhance the predictors' recognition capability. Since these figures are quite easy to be detected and thus may exaggerate the performance of detectors, they do not include in the main dataset of the MOMA. However, they are well prepared and saved in the additional file in my dataset for training.

By demonstrating the multi-angle and realtime working status of mobile machines, videos strengthen the generalization of predictors in realistic working surroundings. By appropriately extracting the images from videos every 50 fps records, I build up the non-iconic part of the dataset. In this part, I consciously select the videos varying the machinery working poses as well as the machine size due to the depth of the perspectives. Since the images are collected from realistic scenarios, occlusion and truncation are inevitable. In this fashion, thousands of images can be produced, making the detector feasible in various practical scenarios and, of course, realtime detection. A volume of 20,895 images was captured from 125 videos, and 5,663 from them were picked out for training the models and their validation.

### **6.4.2 Dataset Format**

I use PASCAL VOC format as the exemplar dataset format for my task. Fig. 6.3 illustrates the structure of the MOMA dataset.

Figure 6.3: Hierarchical structure of the MOMA dataset, based on PASCAL VOC.

Similar to PASCAL VOC, directories "Annotations", "labels", "JPEGImages", and subdirectory "Main" under "ImageSets" are the essential components, with relevant files in them. During the implementation of Faster-RCNN on the MOMA, file types such as XML, jpg, and files train.txt and test.txt in "Main" are in the necessity, while YOLOv3 detector will be trained with label text files, jpg files, and train.txt and test.txt in folder "Main".

Labeling is non-trivial work, to avoid duplicating the creation of rectangular boxes and annotating them, the dataset was initially built only in the format of XML. SOTA object detection algorithms such as Faster-RCNN, SSD, YOLOv3, etc. require basically the same essential annotation information of targets of interests in two-dimensional images, including their coordinates and categories, which are generally expressed in the form of (left, top, width, height, and class). Although ground-truth targets were merely labeled in the format of XML to save labor work, text annotation can be transformed by program correspondingly. During the transformation, the location information for every objects is translated from (xmin, ymin, xmax, ymax) to (xcenter, ycenter, w, h) to fit the two algorithms respectively. Besides, in the annotation, all the coordinates, width, and height are normalized, range from 0 to 1. Therefore attention should be paid whenever parameters for x, y, w, h are calculated. For instance, a constant image size must

Figure 6.4: Labelling tool. Label tool "labelImg" can load multiple images under a directory by clicking "Open Dir" on the menu, and save under pre-defined path. The shown bounding boxes that closely surround object ground truths, which the target instance is excavator in this figure, were made bold for the salience. Multi-class and multi-label for one single instance are possible; all marked labels are at the top right corners. As I consider the PASCAL VOC format as the standard format for the the MOMA dataset, all annotation files are saved in XML format.

be multiplied in the optimization process by k-means clustering of the annotated anchors, because the anchor centroids are measured in pixels.

### **6.4.3 Manual Annotation**

Labeling is exhausting and costly to perform but is the prerequisite in the task of object detection; all the aforementioned annotation files such as XML files in the directory "Annotation" have been labeled manually. I used the label tool "labelImg", which is a famous graphical image annotation tool available in GitHub repository from Lin [232], to accomplish the labeling job. I annotated every single object in an image with a bounding box, enclosing the ground truth of objects and marking the class each object belongs to. Fig. 6.4 illustrates the graphic interface of the label tool "labelImg" and an annotation sample.

The saved XML file for the image annotated as in Fig. 6.4 is represented in following code. It comprises all the ground truth information that I need to train the neural network with the samples.

```
< annotation >
  < folder > images </ folder >
  < filename > sample . jpg </ filename >
  < path >\ path \ to \ the \ sample . jpg </ path >
  < source >
     < database > Unknown </ database >
  </ source >
  <size >
     <width > 1280 </ width >
     <height > 720 </ height >
     < depth >3 </ depth >
  </ size >
  < segmented >0 </ segmented >
  <object >
     <name > excavator </ name >
     < pose > Unspecified </ pose >
     < truncated >0 </ truncated >
     < difficult >0 </ difficult >
     < bndbox >
        < xmin > 561 </ xmin >
        < ymin >52 </ ymin >
        < xmax > 1001 </ xmax >
        < ymax > 382 </ ymax >
     </ bndbox >
  </ object >
  <object >
     <name > truck </ name >
     < pose > Unspecified </ pose >
     < truncated >0 </ truncated >
     < difficult >0 </ difficult >
     < bndbox >
        < xmin > 394 </ xmin >
        < ymin > 344 </ ymin >
        < xmax > 704 </ xmax >
        < ymax > 537 </ ymax >
     </ bndbox >
  </ object >
```

```
<object >
     <name > bulldozer </ name >
     < pose > Unspecified </ pose >
     < truncated >0 </ truncated >
     < difficult >0 </ difficult >
     < bndbox >
        < xmin > 694 </ xmin >
        < ymin > 395 </ ymin >
        < xmax > 950 </ xmax >
        < ymax > 612 </ ymax >
     </ bndbox >
  </ object >
</ annotation >
```
Here I summarize the most decisive info tags in XML that should be kept when transforming the format into txt files.


To decrease possible interference by potential ubiquitous noise, every single object of interest in an image, including occlusions and truncated instances, was labeled with care whenever human eyes can spot them. Cases of occluded and truncated objects are also counted as ground truths. Here I follow the idea from Yu [185] that the images should be specially pointed out if the cases are occluded and truncated objects. Concretely, I annotated a truncated excavator as "excavator, t", and an occluded excavator as "excavator, o", since LabelImg does not have the function to give this selection. The purpose of this method is to propel more robust algorithms. For the implementation of YOLO or Faster RCNN, I create a program to cancel these suffixes.

To ensure the quality of my dataset, consistent rules were made for the labeling process as in the following items:


Figure 6.5: Label example. Two excavators and two trucks should be labeled in this image. They can all be distinguished by eyesight even though they are partially blocked.

Figure 6.6: Label example. Four objects can be clearly seen in the image; even the excavator in the distance is much smaller than the one nearby. Moreover, the two standing workers can be recognized as well.

### **6.4.4 Dataset Splits**

As mentioned above, the dataset is meant for training as well as testing. Therefore I grouped it into four subsets randomly to ensure the training set and test set coincide in the data distribution. In order to achieve a stable estimation of model performance, a reasonable balance between training and test set is required. Depending on the volume of the database, it is quite flexible in determining the partitioning scale. Practically it gains better performance with a smaller proportion of testing set when the size of the dataset is larger. Based on the data amount, I split the dataset into approximately 80% for training and validation, and 20% for testing.

The arrangement of the data division is illustrated in Fig. 6.3. In txt file like trainval.txt image file names with a suffix are stacked. Literately, all images in accordance with names in trainval.txt are intended for training and validation. Likewise, I test predictors using the images regarding the names in test.txt. Data inside trainval can be further split into training and validation subsets. Above all, these four groups work together to make full use of the complete database in order to gain satisfying predicting performance.

### **6.4.5 Data Preprocess**

As a consensus, clean data helps improve detection performance. Prior to the implementation of CNNs, the dataset is analyzed statistically to strike out ineffective labels and ensure its conformity with the working scenarios. This is an essential step since the dataset will be modified to better suit other tasks.

To list annotated labels by the annotation tool "labelImg", a specific Python script was written. By executing the script, a list of class/count pairs would be printed, e.g. (excavator 536). In the case of typo error labels such as "excavater", further steps must be taken to rewrite the revised class into the XML files. The program that can automatically find the error was built [233].

Figure 6.7: Label example. Rectangular correctly encompass ground truths: a dumper, an excavator as well as two persons. However, other trucks can be inferred from context stream frames but are not identifiable in the single image. The label should be ignored if YOLOv3 is used to detect mobile machines.

Moreover, difficult spotted instances should be averted for YOLO and faster RCNN. Fig. 6.7 depicts annotations of unrecognizable trucks and person, which are marked as difficult and need to be removed for YOLO and faster RCNN [233]. In contrast, for other algorithms, these marks may be useful.

In this section, a special dataset MOMA for the CNN-based visual perception of mobile machines was created, and preprocessing in necessity is also introduced. Instead of using my dataset to train the predictors from scratch, I use Darknet-53 trained on ImageNet as a base network.

## **6.5 Evaluation of the Recent Computer Vision Algorithm Performance on the MOMA Dataset**

I would like to encourage the engineers from construction machines to use the similar idea and take advantage of the computer vision technologies for their application. Here, I evaluate the effectiveness of the visual-based safety system and show the model setup. Since many mobile machines predictors have been built with Faster RCNN, I do not show the setup of Faster RCNN again to avoid redundancy. Here I only demonstrate the implementation of YOLOv3.

Object detection tasks demand high computational power, and for some practical cases such as video stream recognition, powerful computing devices are needed. Since Google offers a graphic computing platform on which both models can be trained much faster than on a commonly used local laptop, I tested my monitoring system trained on Google colab where the GPU is free to use. This is to prove that the high performance computer is not necessary. Nvidia GPUs boost the calculation by taking advantage of CUDA. Since mAP performance does not differ much, up to 1%, between different GPU series even with different image scales or non-identical mini-batch sizes, the trained weights can be used on other platforms. All the environments, including GPU, are set up in the configuration files. The framework Faster-RCNN can reach a Frame Per Second (FPS) of 5, while YOLOv3 at about 45 on Tesla k80, which is offered for free by Google. As a comparison, a Nvidia GTX 1050 used in a mediocre laptop can achieve an FPS of 10 with YOLOv3.

The original YOLO algorithm was uploaded by Joseph Redmon on his website. Afterward, several revised versions came out in different programming languages and updated in quite different aspects. In my work, mostly the original version is applied. However, to extend some features, another prevailing repository is

Figure 6.8: Prediction samples with optimized dataset and algorithm YOLOv3.

preferred as well, concretely, I use the version from Bochkovskiy, whose code can be found on his Github<sup>2</sup> .

Ideally, for each category to detect, there should be at least one similar object in the training set, which should comprise likeness of shape, relative size, point of view, tilt, illumination, etc. of the targets. From that perspective, the larger the dataset, the better the detectors will be. However, it may take a long time to train the large dataset with the default settings in the configuration file. On this point, it might make the construction machine engineers flinch from the chance to use computer vision algorithms. Also, even with SOTA solutions, based on the test results on

<sup>2</sup> https://github.com/AlexeyAB/darknet/tree/darknet\_yolo\_v3\_optimal

MS COCO with IoU of 0.5, the best mAP is about 50%. Since my dataset is easier than MS COCO, the test results go to 85%. However, it is surely unacceptable for the construction machines industry due to the safety reasons; it seems like those SOTA solutions should be improved for the detection of construction machines the same as the detection of cars. Alternatively, since the autonomous driving of construction machines is a level four task, which provides the possibility to increase the prediction performance by means of scarifying the generalization capability, i.e., the performance of the predictor for the specific working site with only limited kinds of mobile machines inside is more important than its performance to detect all the mobile machines in the world; thus, I focus on finding a current feasible solution for the construction machines industry in the following context. Generally speaking, if the distribution among the training, validation, and test dataset are the same, the predictor will perform its best performance. Besides, the mAP of the predictor can increase when I add some similar objects from different scenarios in the training data. Therefore, I recommend adding some additional annotated images of target mobile machines into the base dataset and further train the model to get the optimal predictor for the level four task detection. To validate the idea, concretely, I take 666 well-annotated images from the MOMA dataset into the network for training as well as validation. The basic idea of this approach is to increase the recognition rate of the target mobile machines by adding some samples of the target machines to be detected in a relatively small dataset to reduce the difficulty of detection. This approach is based on the assumption that no unexpected mobile machines will come to the working site. Obviously, for a closed working site, this assumption is reasonable. The training time is dramatically reduced, and the prediction results are illustrated in Fig. 6.8. The selected ground truth instances are plotted in the histogram in Fig. 6.9.

On the images in Fig. 6.10, every single inference is marked with a bounding box in a different color to specify its category. Categories are labeled in the bounding box over the top left corner. The model appears to have satisfying performance on those images since they are in a canonical view and thus not so challenging.

Figure 6.9: Class distribution on 666 images: the instances number of truck and excavator outnumber bulldozer and car since they attract more interest.

Figure 6.10: Sample of inference results by the 8,000th predictor on images in iconic view.

Although the predictor with default configuration can easily achieve excellent accuracy on iconic images, it cannot have a satisfying performance on images in non-iconic view, which are not taken from a normal perspective, i.e., with truncated, or blocked by other objects. An image can also become non-canonical when the whole image is obscured or ambiguous, or targets such as excavators are working surprisingly, e.g., sitting in the water. With the default setting of YOLOv3, the optimal performance may not be achieved. To address this problem, here I would like to share some useful tricks to improve the training process and the mAP of the YOLOv3 algorithm.

First of all, by comparing the results from Fig. 6.11 and Fig. 6.12, higher mAP performance can be achieved with a relatively balanced training dataset, i.e., the quantity of each class should not be differ too much.

Figure 6.11: mAP over batches, trained with a balanced dataset.

Second, according to the setting of YOLO, the multi-scale prediction is applied in feature maps. To narrow down the computing without hurting the prediction performance, k-means was implemented [233] to cluster the centroids of the positions of all the labeled objects. Instead of using the default anchor for the dataset COCO, I generate nine anchors as (16.0,26.0), (40.0,40.2), (30.8,84.4), (71.8,84.2), (119.6,124.2), (105.0,219.0), (191.6,175.2), (200.0,290.6), (322.6,346.6).

Figure 6.12: mAP over batches, trained with an unbalanced dataset.

Third, following the expectation that more training batches but with smaller learning rates could improve detection performance, I decay the training steps after the average loss begins to fluctuate. Concretely, I set the learning rate as follows,

```
steps =8000 ,10500 ,12000
scales =.5 ,.1 ,.1
```
Here step learning rate decay of 0.5, 0.1, and 0.1 are applied at the 8,000th, 10,500th and 12,000th step, respectively. Usually, it would be sufficient with 2,000 batches for each class, and no less than 4,000 iterations in total, training work can be then stopped. Also, the learning process can also be stopped when the average loss no longer decreases. After 12,000 training steps, the average loss function converges to no more than 0.1, a quite adequate condition to stop.

With the new predictor, I run inferences on the no-iconic images, which are shown in Fig. 6.14. Mobile Machines like excavators are usually in large size, and predictors may quickly get used to that dimension; however, if target excavators are zoomed out or seen from an irregular perspective, they can be small objects as well. Based on my experiments, the inference ability concerning classification and localization of the predictor on the first three images has been greatly enhanced

Figure 6.13: Hierarchy of predictor with skip connections, e.g. 94th layer, responsible for detecting medium-size targets, relates to the 61st layer before downsampling. Likewise, 36th layer is directly connected to 91st layer by a short cut.

compared to the predictor with the default setting that only found out the large objects in the middle of the figures. It remains blind to the excavator in the last image of Fig. 6.14. Regrettably, nothing is found even though human observers can easily discern the mobile machine (an excavator) on the left. Possible reasons for that are the lacking of remarkable characteristics and its tiny size. More images of this size and pose should be added to improve the identification capability. Although some instances are still not detected, the holistic performance of the predictor is satisfying since a shorter range deserves more attention.

Further differences among the three predictors at the 1,900th, 8,000th, and 12,000th are shown in the 2 × 4 image grid. In Fig. 6.15, a and b are raw images, a1 and b1 are predicted by the first predictor at the 1,900th batch. Similarly, a2 and b2 are from the second predictor at 8,000th. At bottom a3 and b3 are from the third predictor at 12,000th batch. Apparently, the bounding box

Figure 6.14: Inference by predictor at the 12,000th batch made on images with a non-iconic view.

surrounds "dumper" closer as the training steps increase in Fig. 6.15, indicating that the predictor has 12,000 batches acquired a more powerful ability to localize the targets. Besides that, the dumper, which is in the blue bounding box, is recognized, and the fictitious noise of the truck is eliminated, which implies that the class confidence increases with the more trained predictor.

Fig. 6.16 shows the specific AP values on each class predicted by predictors under the different situations. Their trend illustrates AP increases with more batches for most of classes. Here is the test data quite similar to the validation data; hence, the predictors may overfit to the mobile machines that exist in the training and validation data. Although these results exaggerate the algorithm's real ability, it can accurately reflect its performance on the fourth level of autonomous driving.

Although it might make no sense to show the generalization capability of the predictors since the assumption that no unexpected mobile machines are in the working site is reasonable, I further tested the predictor with 8,000 batches on other 5,663 images in the MOMA because I would like to show the performance of my method if the level 4 condition do not hold true. From Fig. 6.17, it can

Figure 6.15: Prediction performance contrast by three predictors, which were made on images with both iconic view and non-iconic view.

be seen that the AP for each class goes much lower. Although the classes person and car are the lowest, it is predictable since I have fewer samples in these two classes. As a counter measurement, samples from other datasets can be added in the base dataset, and thus it cannot lead to a problem. The other colossal gap is the wheel loader. By analyzing the precision-recall curve, I found that the falsepositive dramatically increases as the confidence threshold decreases, resulting in an extremely low AP. Moreover, I further analyzed the false detected samples. I found that most mistakes are the excavators with a shovel facing forwards since they are reconfigured for mines, or some trucks are very close to the cameras so that the wheels are extremely large. These features are not including in the small

Figure 6.16: Individual prediction AP for each class made by predictors at batch 1,900 in blue, 8,000 in orange. The green column demonstrates the performance if I take the assumption that only known mobile machines are working in the working site. i.e. level 4 autonomous driving case.

Figure 6.17: Individual prediction AP for each class on the other 5,663 images.

subset dataset of the MOMA; thus, this typical wheel loader's features let the model believe it encounters a wheel loader. Based on the analysis, I add some mispredicted samples into the training and validation set, and the AP of wheel loader increases then to 0.6. In this way, I rely on a minimal data set to achieve good results on a specific site, though overfitting occurs.

### **6.6 Conclusion**

In this work, I validated the feasibility of creating a visual monitoring system increasing the safety of participants in the closed working site. To create the monitoring system, I build the MOMA dataset, a large-scale and diverse construction machines detection dataset with ground truth label. Most of the images are gained in real scenarios on the working site, while some other images are downloaded directly from the official website of construction machine companies. Instead of gathering the images in the drivers' view, I collect the samples from the outside view of the mobile machines since I believe it is more in line with the actual situation of autonomous driving of construction machines. With my dataset, YOLOv3 is possible to detect mobile machines with mAP of 85% (Fig. 6.17) in general, which is much better than the previous works without using the deep learning algorithms. Notice that I only compared the researchers who have confidently published their code. Also, without considering the instances outside of the specific working site, the mAP goes to almost 90.7% (Fig. 6.16), which indicates that the predictor is ready for a level four autonomous driving task. Since YOLO is more friendly to real time applications, I recommend adopting this algorithm for the recognition task of construction machines. Finally, recognition performance depends on the dataset quality and how people train the algorithms. By further expanding the data collection and annotation, more satisfying results can be expected. Hence, I also recommend adding the images of interest, such as the excavators or dumpers that are going to be detected, into my MOMA dataset and further train the pretrained model to get a predictor, which is the best suit for the specific level four task.

# **7 Wireless Communication System<sup>1</sup>**

The fleet management of mobile working machines with the help of connectivity can increase not only safety but also productivity. However, rare commercial mobile working machines have taken advantage of V2X communication. Current mature wireless communication technology can be roughly divided into ad-hoc network and cellular network. In this chapter, I suggest that both IEEE 802.11p and 5G should be implemented for fleet management. In the first part, I proposed an analytical model for machines to estimate the ad-hoc network performance, i.e., the delay and the packet loss probability in realtime based on the simulation results I made in ns − 3. The model of this part can be further used for determining when shall ad-hoc or cellular network be used in the corresponding scenarios. Afterward, I demonstrated the scenarios where 5G can have a significant effect on the construction machines industry. Also, based on the simulation I made in ns − 3, I compared the performance of 4G and 5G for the most relevant construction machines scenarios. Last but not least, I showed the feasibility of remote-control and self-working construction machines with the help of 5G.

<sup>1</sup> Except some tiny modifications, e.g., sequence of the text, all the figures, text, and results of the presented work in this chapter have been published in my publication [40, 41]. My contributions to these paper are summarized as 90% and 80% in terms of conception and methodology, 90% and 90% of literature review, 80% and 40% of coding, 80% and 40% of results visualization, and 95% and 95% of formulation, respectively.

## **7.1 Introduction**

Besides artificial intelligence [63], the fleet management of mobile machines is the principal research direction of the IoT in the fields of mobile working machinery. Currently, the mobile machines are distributed sparsely in the working site and working at low transport speed to avoid a collision. With the vehicle-to-everything (V2X), the information about current position, speed or even destination and task are exchanged periodically between individual mobile machines. Since the intentions of neighbor mobile machines within sensing range are known, the working machine can work more densely and transport the material more efficiently. The most challenging and research-worthy use case can be described as the task of repairing the highway. During repairing the highway, a traffic congestion is usually expected. According to the study from Triantis, traffic congestion causes significant economic losses [234]. Apparently, by investing more machines with the help of V2X technology in a particular site can surely improve the working productivity, so that the economy lost due to the congestion can be diminished. Assuming that, all or part of the vehicles are equipped with V2X, a high channel load of V2X network occurs in the traffic congestion. Thus, the V2X performance decreases, manifesting in larger delay and packet loss probability. In this chapter, I first evaluate the performance of the IEEE 802.11p standard for varying node density rates by means of simulations using ns − 3 2 [235]. Since the simulation model is computationally expensive, I then propose an fast estimation model for mobile machines to predict the mean delay and package loss probability of the IEEE 802.11p-based V2X network.

Fig. 7.1 illustrates the benefits of the implementation of V2X technology on mobile machines.

<sup>2</sup> Ns-3 is one of the most widely used software for network simulation. It is open-source, scalable, and actively developed by the scientific community. Moreover, its documentation is excellent. Its popularity and flexibility make me selected it as the tool for my research. Other network simulators usually mentioned are OMNET++, SWANS, NetSim, QualNet.

Figure 7.1: Comparison the working site with/without V2X: More mobile working machines on the site, much higher productivity.

## **7.2 Current Wireless Communication for V2X**

### **7.2.1 Ad-hoc Networks**

The time-efficient and reliable message exchange among vehicles have been a longstanding issue for Intelligent Transportation System (ITS), which aims at enhancing the driving safety management as well as fulfilling requirement for infotainment service. Currently, there are two common used technologies for V2X, IEEE 802.11p and 3GPP Cellular-V2X [236]. IEEE 802.11p is the first standard for vehicular communication [237]. Both ITS-G5 and the Wireless Access in Vehicular Environments (WAVE), which is proposed by the EU and the US separately, amend the IEEE 802.11 standard for vehicular use [238].

In the last two decades, the tremendous evolution of wireless communication technique has paved the way for the materialization of ITS. In 1999, 75 MHz of free but licensed spectrum at 5.850-5.925 GHz was allocated by US Federal Communications Commission (FCC) for implementation of the Dedicated Short Range Communications (DSRC) exclusively for the vehicle to vehicle/infrastructure communications. In the US, the spectrum is divided into seven 10 MHz channels with 6 Service Channels (SCHs) and a Control Channel (CCH). Compared with the US, the European Union (EU) introduced five channels (5.875-5.925 GHz), where CCH is restricted to safety usage only [239], i.e., Cooperative Awareness Message (CAM). CAM is a periodic broadcast message which contains safety-relevant information, such as position, speed, acceleration. Until the time when the author writes this thesis, the final version of the IEEE 802.11p is the version published in 2010 [238]. IEEE 802.11p is an ad-hoc network that has a mesh topology and thus has shortages such as a limitation to the short communication range, the medium mobility, as well as the contention. The coverage of IEEE 802.11p mainly depends on the transmit power [240], path loss, signal fading, delay spread, Doppler spread, and angular spread [241]. The delay is unbounded, caused by Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) [242].

### **7.2.2 Cellular Networks**

In comparison with the WLAN-based IEEE 802.11p, C-V2X uses the cellular networks and thus the communication relies on base stations. C-V2X uses 3GPP standardized 4G Long Term Evolution (LTE) or 5G mobile cellular connectivity [243]. As Vukadinovic pointed out, the C-V2X is a developing technology, from 3G to 5G [244]. With a supervised star topology, the collision of information is avoided. However, an obvious shortage of cellular network is the relative high delay even under a low channel load due to the round-trip between transceiver nodes and the base station. In release 14, 3GPP introduced direct Vehicle-to-Vehicle (V2V) communication outside of coverage under LTE-V mode 4 [245]. However, the distributed scheduling for LTE-V mode 4 is principally cannot totally avoid collisions. As the best of author's known, a congestion avoidance mechanism from 3GPP doesn't outperform IEEE 802.11p.

2020 is considered the first year of the 5G era in the wireless community since 5G is commercially employed in this year. To date, 5G is still a fast-developing research subject; thus, opposite views exist simultaneously. To avoid exaggerate the 5G technology, I only take the parameters and data that more than at least half of the community agree with into account. Although some controversies, I do not distinguish between 4G and LTE according to Dahlman [246].

To overcome the shortcoming of 4G [247], the basic requirements for the 5G are drawn by [248, 249, 250, 251, 252]: higher transmission rate, shorter latency, higher reliability, and more User Equipment (UE) connection. Correspondingly, the big 3 concepts: enhanced Mobile Broadband (eMBB), Ultra Reliable Low Latency Communications (URLLC), massive Machine Type Communications (mMTC) [253], were proposed. According to the 3rd Generation Partnership Project (3GPP) 38.101 agreement [254], 5G New Radio (NR) mainly uses two frequency bands: FR1 frequency band and FR2 frequency band. The frequency range of the FR1 band is 450 MHz-6 GHz, which is also called the sub 6 GHz frequency band; the frequency range of the FR2 band is 24.25 GHz-52.6 GHz, usually called millimeter wave (mmWave) band. Currently, the most influential providers in the field of 5G are Huawei for sub 6 Ghz band and Qualcomm for the mmWave band, separately. Other competitors mentioned quite often are Samsung, Ericsson, Datang, Nokia, Telecom, Intel, and ZTE. As we know, the higher the frequency, the closer the characteristic is to the light. That is, the propagation of the signal will be more similar to the light, which only goes straightforward so that the obstacles can easily block it. Also, the energy loss increases dramatically as the propagation distance increases, and proportionally to the square of the frequency. Consequently, the coverage problem, which restricts the promotion of the high-frequency spectrum 5G, occurs due to the nature of the mmWave. For this reason, most countries, such as China, Japan, and Korea, give priority to the sub 6 Ghz band since the coverage is much larger, and thus more people can benefit from 5G technology. Compared to 4G, which only has 20 MHz channel bandwidth, 5G is allocated about 100 MHz in the sub 6 Ghz area. Moreover, thanks to the novel Multiple-Input Multiple-Output (MIMO) technology, more antennas are used simultaneously to achieve a much higher transmission rate than the previous 4G technology. Compared to the 4G handsets, which only have 2\*2 or 4\*4 antennas, 5G base stations and UEs have antenna array to increase the spectrum utilization [255, 256]. However, since such 5G UEs also use the sub 6 Ghz band, there is principally not greatly different than 4G, and thus some serious problems are inevitable. First of all, because the sub 6 Ghz area is also used by 2G, 3G, 4G, and thus already very crowed, a further increase of the bandwidth is almost impossible. Although some communication operators give 5G more channel bandwidth, which was belongs to 2G and 3G to increase the bandwidth of 5G further, the bandwidth is surely not enough for the future potential requirements. In addition, the configuration of the antenna depends on the signal frequency. At sub 6 Ghz, the wavelength is more than 1 cm, so that the number of the antenna in the UE, in this case, is also limited. Therefore, soon after sub 6 Ghz was promoted, how to use the higher FR2 frequency regions, i.e., higher than 28 Ghz, has become a hot topic. Compared to the sub 6 Ghz region, it is quite easy to have 1 Ghz channel bandwidth in the FR2 region so that the transmission ratio is expected to be much higher. In the mmWave frequency band, taking the 28 GHz frequency band as an example, the available spectrum bandwidth has reached 1 GHz, while the available signal bandwidth of each channel in the 60 GHz frequency band is 2 GHz [254]. In the case of constant spectrum utilization, if the mmWave frequency band is selected, the data transmission rate can be doubled by directly doubling the bandwidth. Since 3GPP has decided to continue to use Orthogonal Frequency Division Multiplexing (OFDM) technology for 5G NR [254], mmWave technology has become the biggest novel idea of 5G. Although mmWave is already used by satellite, they were considered as infeasible for the daily life scenarios. Until recently, the novel technology unlocks the highfrequency spectrum. Concretely, thanks to antennas array, which constitutes a large number of antennas and the beamforming technology [257], the energy can be concentrated in small regions. Moreover, because the antennas for mmWave can be designed much smaller than the microwave antennas, the antennas in the mmWave antenna array are much denser and achieve a larger number for the same geometrical apparatus. Along with a certain number of small cell base stations, mmWave comes to the forefront of commercial applications. The introduction of other important 5G technologies, such as new numerology, LDPC/Polar codes, etc., can let OFDM technology better extend to the mmWave band. To adapt to the large bandwidth characteristics of mmWave, 5G defines multiple sub-carrier intervals, of which the larger sub-carrier intervals are specifically designed for mmWave, whereas the lower is for the compatibility of previous system deriving from the 4G era. One of the main goals of 5G is to support URLLC services with stringent requirements for reliability and delay. LTE achieves a user plane two-way wireless delay of less than 10 ms, and the design goal of 5G is to reduce this delay by at least 5 times, that is, less than 2 ms. According to the 3GPP TS 38.211 protocol [254], the 5G NR physical layer provides multiple subcarrier spacing configurations [258]. By increasing the sub-carrier spacing, the duration of OFDM symbols is reduced, thereby reducing the duration of a single time slot and reducing delay. The 3GPP protocol claims that the sub-carrier spacing is inversely proportional to the OFDM symbol duration, which is an inherent attribute of OFDM. For the current network communication technology, the key capability indicators of the 5G system have been greatly improved. The information transmission delay of the 5G network can reach milliseconds, which meets the stringent requirements of the network and guarantees the safety of controlled UE. The peak rate of 5G can reach 10-20 Gbit/s, and the number of connections can reach 1 million/km<sup>2</sup> [259]. Apparently, although the technology can overcome the difficulties of implementing the mmWave, the base stations for mmWave are energy-consuming equipment. Thus, Heterogeneous Network (HetNet) is also essential in the 5G era, i.e., most scientists in the wireless community believe that both sub- and above 6 Ghz networks will coexist in a long time. The same as LTE, 5G also has device to device network to solve the problem when UEs are outside of the coverage of base stations [260].

## **7.3 IEEE 802.11p**

## **7.3.1 Why I Use the IEEE 802.11p?**

Despite the fact that LTE has a series of advantages, I would like to adopt the IEEE 802.11p as the first version for connected mobile machines due to the following reasons. First of all, to fully make the advantages of C-V2X, mobile machines need a base station nearby, which varies from 10 m until 10 km [261]. However, for the fleet of mobile machines that are working far away from urban, they might fail to find a base station nearby. Moreover, the usage of 802.11p is free of charge. Different from the cellular network which the users must pay for the service from the network operators, the 5.9 GHz band is a free but licensed spectrum [237]. In addition, IEEE 802.11p is well designed for the vehicle industry so that no additional modification is needed for vehicle onboard ECU [262]. Thus, the compatibility of IEEE 802.11p is better than cellular networks for the mobile machine which is designed without the consideration of V2X. Usually, mobile machines drive at a relatively low speed. Furthermore, the communication between other onboard units, for instance, driving cars and mobile machines is not essential; thus, the under-performed ability to deal with vehicle mobility by IEEE 802.11p, based on the analysis of Alasmary's study [263], can be ignored. Although there have no consensus about which wireless technology is the more promising technology, scientists from both sides agree that the combination of LTE and 802.11p have a certain improvement in performance compared to if only one technology is used [236, 240, 262, 264]. Thus, I would like to use IEEE 802.11p as the communication technology for the initial version fleet management. Even though the passenger car industry adopts cellular technology in the future, the idea of using IEEE 802.11p for mobile machines is still sensible, because the congestion of the channel is consequently alleviated.

#### **7.3.2 Modelling**

Mecklenbräuker has shown the common scenarios in their paper [241]. Unfortunately, for mobile machines that have the task to repair the highway, the scenario does not belong to these common ones. Firstly, there has usually no buildings around the working site, but the traffic is congested. Secondly, instead evaluate the communication among all the participants in the ad-hoc network, only communication among mobile machines is essential.

#### **7.3.3 Propagation Model**

In [265], a comparative analysis between different propagation models is performed. Based on Stoffer's study, there is no best model for all cases, and the users should select the model depending on the concrete use case. Because engineers are mainly interested in delay and packet loss resulting from congestion control algorithms at MAC layer and the highway is more similar to an urban scenario, I used a log-distance path loss model proposed by [266]. It is denoted as

$$PL(dB) = PL(d\_0) + 10 \cdot n \cdot \log(\frac{d}{d\_0}) \tag{7.1}$$

where P L(d0) is defined as the path loss at the reference distance (d0), and P L(d0) = 46.6777dB. n refers to the path loss distance exponent varying from the propagation environment, and n = 3.

Since the single factor that influences receive power is the distance from the transmitter, in the following simulations, the dynamic mobility model is not applied to vehicles. Still, the relative positions of the vehicles are randomly initialized.

### **7.3.4 CAM' Generation Model**

Venel presented that CAMs are generated at a rate in a range of 2 to 20 packets/second corresponding to multiple factors such as driver's reaction time and vehicle speed [267]. Thereby, I apply a mean value from them, namely 10 packets/second (10 Hz). In addition, the length of a packet varies from different applications in real-world vehicular communications. In the following simulations, packet length is set to be 450 bytes, which ensures the necessary information for the safety-related application<sup>3</sup> . Since the generation rate and CAM length are constant throughout the simulation, the channel load is only depended on the number of nodes in the scenario.

## **7.3.5 CSMA/CA and Enhanced DCF Channel Access (EDCA)**

CSMA/CA algorithm is specified in IEEE 802.11 to schedule transmissions over a single channel by differing the access attempt with a random back-off time. In the meantime, EDCA introduces Interframe Spaces (IFS) and different contention window size to prioritize access categories and to improve quality-of-service (QoS) [269].

Since the primary emphasis of this chapter is on the congestion control algorithms at MAC layer and CAM length is constant, the term delay in the following part will always refer to the back-off time between the time point that a node requests for channel access and the packet is forwarded from the MAC layer to the PHY layer, neglecting the transmission time depending on packet length and propagation time depending on distance. Tab. 7.1 contains the vital parameter settings that I use.

<sup>3</sup> Based on the Survey on ITS-G5 CAM statistics CAR 2 CAR Communication Consortium, CAM size is a design parameter and 30% of the messages are above 450 bytes. The typical V2X messages' size falls within the range of 60-800 bytes [268].


Table 7.1: Simulation parameters

There are two ranges, i.e. transmission range and sensing range for each transmitter, since the CAM header and payload are modulated with different schemes and have different immunities against noise and channel fading. The Physical Layer Convergence Protocol (PLCP) header is modulated with Binary Phase Shift Keying (BPSK) [270] and the payload is transmitted in the form of Quadrature Phase Shift Keying (QPSK) modulation. Here I did an experiment 200 times and each time I let the distance between the transmitter and the receiver gradually increase. Simulation results show that, the transmission range is equal to 115 m corresponding to a SINR level at 6.49825 dB and the sensing range is equal to 175 m. That is, in case more than 115 m, the receiver cannot decode the content of the messages, and in case the distance is more than 175 m, the receiver cannot even get the headers. Once two transmitters are distanced more than 175 m, they can send packets simultaneously, being unconscious of the busy channel status. In this case, as shown in Fig. 7.2, they are called "Hidden Node". Multiple arbitrary packets may collide at the receivers who are visible and connectable to both hidden nodes. The interference between each other results in transmission failures.

Figure 7.2: The hidden node problem.

In short, the scenario I analyzed is a working site on the highway where the communication performance among mobile machines under the interference from cars nearby.

### **7.3.6 Evaluation of Hidden Node Problem**

To evaluate the impact of hidden node problem on vehicle network, a set of simulations is considered as follows: I set the transmitter and receiver, i.e., the dumper and the excavator in the figure, very closer to each other. A total of 80 neighbor nodes is equally divided into two groups, which are symmetrically distributed on both sides of Transmitter/Receiver pair (Tx/Rx). Concretely, twelve simulations are executed, with the distance between two groups of neighbors increases by 20 m from 0 to 220 m and 300 CAMs are sent per each node. Here the most critical performance is whether the receiver can get the information sent by the transmitter under the distribution of the neighbor nodes. The simulation setup for testing the hidden node problem is shown in Fig. 7.3.

How the different distances of two neighbor groups impact the mean delay, packet collision probability, and packet loss probability of transmitter and neighbor are

Figure 7.3: Schematic view of simulation scenarios.

demonstrated in Fig. 7.4, Fig. 7.5, and Fig. 7.6, individually. Performance observed at the transmitter and neighbors are illustrated with blue and red curve, respectively. As reference, the yellow and green dotted lines indicate the simulation results in which 40 and 80 neighbors are located at the same position as the transmitter.

With respect to mean delay in Fig. 7.4, the curve for neighbors remains stable within 115 m and then rises in the sensing range owing to the additional Extended Inter-Frame Spacing (EIFS) appended to Arbitrary Inter-frame Space (AIFS). Finally, it sinks significantly when the two groups are more than 175 m apart from each other. In this case, they are hidden to each other. Therefore, the delay in each group is approximate to the scenario with just 40 neighbor nodes in the transmission range. In the meanwhile, the curve for transmitter fluctuates slightly. The reason is that the mean delay of transmitter is averaged by 300 packet in contrast with 80 × 300 packets of neighbors. The mean delay of the transmitter decreases when two neighbor groups are in each others' sensing range because the higher delay of the neighbors provides the transmitter a higher probability to access the channel. When the neighbors are hidden to each other, transmissions from hidden nodes overlap with each other, the whole channel busy time decreases. As a result, mean delay of transmitter declines.

Figure 7.4: Mean delay [µs] versus distance between two groups of neighbor nodes [m].

Similarly, the packet collision probability, which solely depends on the number of sensible nodes, are shown in Fig. 7.5. The red curve for neighbors remains coincident with 80 neighbors' scenario and grows down rapidly to the 40 neighbors' level as the two groups become hidden nodes to each other. In the meanwhile, the collision probability of the transmitter keep steady until the neighbours become hidden nodes. Since more idle channel is released due to overlapped transmissions, as mentioned in the previous paragraph, the packet collision probability of the transmitter declines, as wells as its packet loss probability, which is shown in Fig. 7.6.

The overlapped transmissions from hidden nodes packets are collided and corrupt at the receiver, resulting in a dramatic growth on packet loss probability of the neighbor nodes, which can be clearly seen from the red curve in Fig. 7.6. In the meanwhile, the transmitter has less collided transmissions. In brief, the transmitters benefits from the appearance of neighbor nodes in form of hidden nodes in pairs, in terms of less mean delay, packet collision probability, and packet loss probability.

Figure 7.5: Packet collision probability versus distance between two groups of neighbor nodes [m].

Figure 7.6: Packet loss probability versus distance between two groups of neighbor nodes [m].

The factor number of neighbors has a significant impact on the network performance, particularly in the case that packet length and generating rate is fixed.

## **7.3.7 Empirical Model for Fast Estimation of Ad-hoc Network Performance**

Although ns − 3 can simulate the V2X performance regarding the delay and the probability of lost packet, I still need a quick estimation method, so that onboard ECU can obtain V2X performance in realtime and evaluate the plausibility of V2X data. Therefore, I build an empirical model to fast estimate the network performance based on the results from ns − 3. Since the contention behavior due to CSMA/CA in corresponding ranges should follow the same roles, which highly depends on the number of neighbors, I introduce the analytical model as follows.

#### **7.3.7.1 LUT Generation**

For each cluster, e.g., the area within the transmission range and the area between the transmission and sensing range, I generate a Lookup-Table (LuT) in advance, which contains a set of crucial performance indicators in relationship with varying number of neighbors. To reduce the effect of randomness, I average the results from a large number of CAM transmissions.

To generate LuT for 1. cluster, I execute the following simulations. The neighbors are located at the same position with 60 m away from the transmitter. The number of neighbors varies from 5 to 200, with a step of 5 in each scenario. Furthermore, for each of the 40 scenarios, 5 simulations are conducted, in which every single node schedules 1,000 transmissions. The same simulations are executed for the 2. LuT, only the neighbors are 140 m away from the transmitter.

Four metrics of the transmitter are measured, as shown in Fig. 7.7, e.g., collisions probability (Pc), packet delay probability (Pd), packet loss probability (Pl), and mean delay (tmd). The term collision indicates the access attempt occurs during

Figure 7.7: Packet delay probability, packet collision probability, packet loss probability and mean delay measured with varying number of neighbors in 2 clusters are included in the LuT.

the duration, in which another node is transmitting. Moreover, the access attempt can also be differed due to the on-going AIFS, which follows the previous transmission, even though the channel is idle. Therefore, the percentage of delayed packets is slightly higher than the percentage of collisions. The metrics packet delay probability and mean delay indicate how probable the packet would be delayed due to an access contention, and once the delay occurs, what would be the average duration.

#### **7.3.7.2 Performance Estimation**

For each on broad unit in the scenario, the number of neighbors located in each of the two Clusters are measured. The analytical result is derived from the sum of two values that are interpolated and extracted from LuTs. Furthermore, the upper limit for an analytical percentage is equal to 1. Eq. 7.2 and Eq. 7.3 demonstrates this idea,

$$
\hat{\Phi}\_{A,t,n} = LuT\_{t,1}(n\_T) + LuT\_{t,2}(n\_S) \tag{7.2}
$$

$$\hat{\tilde{\Phi}}\_{A,p,n} = \min\{1, LuT\_{p,1}(n\_T) + LuT\_{p,2}(n\_S)\}\tag{7.3}$$

where Φ ˆ˜ A,n is the naive estimation of the performance of the ad-hoc using the analytical model, the footnote t and p denote the estimation in terms of time and probability, respectively. n<sup>T</sup> is the number of nodes inside of the transmission range, n<sup>S</sup> is the number of nodes inside of the sensing range.

#### **7.3.8 Validation and Calibration**

In this section, I first validate the viability of the analytical model and then introduce the correction factor to eliminate the error between the naive LuT and the realistic simulation results.

In the validation simulation, the traffic scenario is set to be a 1,500 m long highway with 3 lanes in each direction. 500 onboard units equipped with 802.11p devices are located statically. A congested traffic due to a highway worksite is assumed. The simulation is set up with a total simulation time of 100 s, in which the vehicles are randomly distributed on the road.

The delay relevant metrics are simulated and estimated among all onboard units. This is because each transmission has a unique channel access time, which is independent of reception. In the meanwhile, for each onboard unit, the packet loss probability is measured on a single receiver, which is located randomly within its 15 m range, corresponding to two cooperating mobile machines.

Fig. 7.8 represents the correlation coefficients for each performance metric, which evaluate the strength of the association between simulated and analytical results. For an optimum fitting, the blue dots are supposed to be correctly distributed along the diagonal line, which denotes a correlation coefficient of 1. The correlation coefficients for the mean delay, packet delay probability, and packet loss probability are 0.9417, 0.9277 and 0.9167, which manifest a strong correlation and satisfying estimation ability of the analytical model.

To optimize the estimation performance of the proposed analytical model, the term correction factor (fc) is introduced,

$$
\tilde{f}\_c = \frac{\tilde{\Phi}\_S}{\hat{\tilde{\Phi}}\_A} \tag{7.4}
$$

where Φ˜ <sup>S</sup>, Φ ˆ˜ <sup>A</sup> are the performance matrix from the simulation and the analytical model regarding the tmd, Pd, P<sup>l</sup> , separately.

Obviously, my goal can be demonstrated as Eq. 7.5:

$$\min(J) = \sum\_{i}^{n=N} \left(\tilde{f}\_c \cdot \hat{\tilde{\Phi}}\_A - \tilde{\Phi}\_S\right)^2\tag{7.5}$$

where N denotes the total number of vehicles.

Figure 7.8: Correlation coefficients of 3 metrics are close to 1, which indicate a good feasibility of analytical estimation. To increase estimation accuracy, I introduce f˜c.

The Φ˜ <sup>S</sup>/Φ ˆ˜ <sup>A</sup> is shown in the bottom right sub-figure in Fig. 7.8. The three curves from top to bottom indicate the f<sup>c</sup> for mean delay, packet delay probability and packet loss probability. The uniform color in the center area indicates that the naive analytical estimation method has stable performance and thus can be adjusted by multiplying appropriate correction factor fc. Among 3 metrics, packet loss probability is dramatically underestimated and needs a larger fc. This is because, in the LuT generation scenario, a reception is failed only due to multiple transmitter attempts to access the channel simultaneously, without consideration of hidden node. However, in the realtime simulation, the transmissions from the hidden nodes cause interference at the receiver. Consequently, the reception is more like to corrupt due to lower SINR.

The correction factor differs in the discontinuous edge of the scenario, where hidden node problem is not obvious. In this case, I introduce another correction factor. Tab. 7.2 records the correction factor in the center (fc,c) and the correction factor at the edge (fc,e), where the results are calculated based on Eq. 7.5.


Table 7.2: Correction factors

After using the correction factors, the analytical model outputs a very similar result to the simulation model. Furthermore, the LuT is portable to scenarios with different PHY parameters and path loss models, by re-calculating the transmission and sensing range size, since the contention mechanism due to CSMA/CA stays the same.

## **7.4 The Fifth-Generation Mobile Networks**

The fleet management of mobile machines is an interesting research direction of the Internet of Things (IoT) in the construction machines industry. Besides using the ad-hoc network as the first version for mobile machines [40], 5G attracts huge attention to be expected to achieve even higher-quality communication. As mentioned in the earlier part of this chapter, WiFi technology can accomplish realtime communication among mobile machines so that they will work denser and faster. As a consequence, engineers can increase productivity and therefore reduce the duration of the construction projects. This is meaningful for the cases of repairing projects on the highway, mining projects, and transportation in harbors. Since mobile machines are usually working surrounded by dust and Lidars are quite sensitive to this case, cameras are a more robust and promising approach towards self-working machines or remote control of mobile construction machines. As we know, as the videos' resolution increases, both image recognition algorithms and humans can acquire information easier and more accurate. However, the capacity, especially the uplink capacity of WiFi technology, limits the introduction of wireless HD video transmission for construction machines. As I did not find comprehensive research indicating how can 5G change the mobile construction machines industry, I first analyze the potential use cases for the implementation of 5G for the construction machines industry in this chapter. Followed by illustrating the benefits by utilizing 5G with my simulation results by means of ns − 3 [235]. Last but not least, I show the blueprint of future smart working sites based on the simulation results. Fig. 7.9 and Fig. 7.10 demonstrate the potential use cases of 5G in the field of mobile construction machines.

#### **7.4.1 Where Can Working Sites be Benefited from 5G?**

According to GSMA's outlook in 2020, mmWave can roughly make economic benefits 212 billion dollars only in the Asia Pacific region in 2034. Among them, 3% to 9% of the amount will come from the agriculture and mining industry.

Figure 7.9: Remote control with live streaming: here cameras will be installed on the mobile machines while the driver sits in a comfortable room to operate the machines remotely. Thanks to 5G, HD video streaming can be sent with low delay and high reliability.

To date, 5G mmWave need a lot of micro base stations, and they are also energyconsuming [272] and cost-consuming. Moreover, the shortcoming of mmWave will be amplified by the harsh environment on the working site, such as the blockage of dust and giant machines. However, it did not stop the engineers to adopt this new technology in the construction field. Currently, most people believe that IoT technologies will endow the mobile construction machines industry with the ability, such as predictive maintenance, data analytics, and visualization and notification. Besides these wisdom, other scenarios are remote control and selfworking mobile machines with which previous communication technology cannot do well. In some dangerous traditional industries, such as remote maintenance of underground pipelines, remote rescue of landslides, underground mine excavation, etc., these industries' operating environment is hazardous and harmful to the

Figure 7.10: Self-working mobile machines: here, cameras will be fixed on the ground instead of being installed on the machines to avoid the obstruction of vision. The stream will be uploaded to the center commander and be processed on the cloud. Based on the stream from more than two cameras, the depth information and motion of machines can be acquired. Afterward, the command signal will be sent directly to the machines. The research about instance segmentation of construction machines can be found in [271].

human body. Although remote control is achieved with a wired network for nowadays projects, the flexibility is limited by the cable connected to the vehicle so that remote control is only used in some particular cases. Thanks to 5G, the remote control can be performed without the limitation of cables so that 5G accelerates the usage of remote control. In this case, the cameras are usually installed on the machines to collect the surrounding environment information [273, 274, 275]. Since they typically need more than three cameras to get the information, and the transmission rate of WiFi is limited, they cannot install more cameras to create the depth information resulting in lower productivity even with the very best operators [276]. Considering virtual reality technology will be adopted with 5G, the difficulty of the remote control will be dramatically reduced. Better than the earlier network technologies, 5G guarantees the efficiency and accuracy of the remote control. Another major expected application is self-working machines. Cooperating with deep learning-based image processing models [62], the image can be further processed on the local cloud. The command can then be directly sent to the machines. To avoid the additional cost, many scientists point out a smartphone can be used as an intermediary to transmit information instead of installing additional equipment [64].

Although 5G shows excellent progress compared to 4G and WiFi, for end customers to accepted a new technology, a sudden colossal improvement is always necessary. Currently, most people believe that IoT technologies will endow the mobile construction machines industry with the ability, such as predictive maintenance, data analytics, and visualization and notification. However, I find that they are actually nice-to-have technologies. Since 5G may need a lot of micro base stations, and they are also energy-consuming [272], the value created by predictive maintenance is quite difficult to compensate for the additional cost of 5G. In many cases, preparing some backup vehicles can be a more effective and money-saving solution. Moreover, the shortcoming of mmWave will be amplified by the harsh environment on the working site, such as the blockage of dust and giant machines. Thus, I believe more realistic scenarios are remote control and self-working mobile machines since 5G achieves something engineers cannot do well before. In some dangerous traditional industries, such as remote maintenance of underground pipelines, remote rescue of landslides, underground mine excavation, etc., these industries' operating environment is hazardous and harmful to the human body. Although remote control is achieved with a wired network for nowadays projects, the flexibility is limited by the cable connected to the vehicle so that remote control is only used in some particular cases. Thanks to 5G, the remote control can be performed without the limitation of cables so that 5G accelerates the usage of remote control. In this case, the cameras are usually installed on the machines to collect the surrounding environment information [273, 274, 275]. Since they typically need more than three cameras to get the information, and the transmission rate of WiFi is limited, they cannot install more cameras to create the depth information resulting in lower productivity even with the very best operators [276]. Considering virtual reality technology will be adopted with 5G, the difficulty of the remote control will be dramatically reduced. Better than the earlier network technologies, 5G guarantees the efficiency and accuracy of the remote control. Another major expected application is self-working machines. Cooperating with deep learning-based image processing models [62], the image can be further processed on the local cloud. The command can then be directly sent to the machines. To avoid the additional cost, many scientists point out a smartphone can be used as an intermediary to transmit information instead of installing additional equipment [64].

In the above scenario, there are three key technologies for remotely controlling or self-working construction machinery. The first one is the high-speed data transmission rate. In order to enable the AI or human to fully understand the situation in realtime, construction machinery will be under the sight of HD cameras or wear the cameras for an operator to get the video streaming data collection. The transmission of HD video requires a large bandwidth to ensure the fluency and realtime transmission of video content. The second is the low delay in receiving information. The realtime issuance of interactive behavior between operators and controlled construction machinery requires the network to have low latency to ensure that the controller's command can be executed in realtime through actuators. The third is the rapid and convenient communication network deployment between the construction machinery and the operators. If a wired network is used between the construction machinery and the controller, although the network delay and bandwidth can be guaranteed to a certain extent, the cable makes the activity of the construction machinery limited. Moreover, the rapid deployment of networks between construction machinery and controllers cannot be easily achieved. If a 4G-LTE wireless cellular network is used, due to the limitation of the transmission rate and delay of the 4G-LTE network, the bandwidth and delay of the existing wireless network may not stably meet some high-rate and low-delay scenarios. These technical bottlenecks make the remote control collaboration project encounter many difficulties in the industry's practical application. No wonder so far, it has not been able to achieve widespread development and deployment. The large bandwidth and low delay technologies of the 5G network can solve these technical bottlenecks. 5G is bringing new opportunities for the industrial development of remote-controlled construction machinery.

#### **7.4.2 Problem Statement and Goal**

In the previous study from Bermudez [277], they tested the performance of the LTE network by the transmission of video data. Their article evaluated two protocols' behaviors, Realtime Messaging Protocol (RTMP) and Realtime Streaming Protocol (RTSP), in a 4G environment. Based on their results, I find that the performance of LTE to transfer the HD video from the working side to the operator side in realtime is good but not fully satisfying.

Also, the throughput of LTE is in a steady-state growth situation. That means, the simulation parameters of Bermudez [277] missed the extreme working critical condition. Whether the LTE network can always have an excellent performance in a more stringent remote-control situation was not shown. Therefore, the comparison between LTE and 5G for video transmission in construction scenarios is necessary. For remote control, the delay is always a significant indicator because it equals to the accuracy and reliability of the job and the safety of the controlled machine [273]. Inspired from this and to fill this research gap, I compare the performance from one of the new 5G technologies, mmWave, with the LTE network's performance for construction machinery in remote-control and selfworking scenarios. Meanwhile, I give the simulation a more stringent critical environment. Under the goal of finding out whether 5G network is more suitable for remote control or self-working construction machinery than LTE or not and if so, how good it is, a similar research is not in existence.

#### **7.4.3 Modelling**

In the scenarios shown in Fig. 7.9 or Fig. 7.10, that my UEs, i.e., construction machines, are under the sight of HD cameras or with the HD cameras. Here I assume the construction machines and cameras are both connected with the base station and the operator. The operator will give the construction machines commands. Meanwhile, they will collect the video streaming data from cameras. In case that the cameras are stick to the machine, the operator will give the order and receive the video data simultaneously. Compared to the instruction from the operator, video streaming data will occupy a much larger bandwidth. Therefore, in the research, I use video streaming as the media, which can verify the performance of both networks. Obviously, video streaming with different resolution occupies different network bandwidth. Depending on the different resolution requirements of video streaming, different pressure will be applied to the network.

For the research, I use ns − 3 [235, 278] as the simulation tool. To perform LTE simulation, I directly call the LTE module inside ns − 3 because there is already a complete set of simulation modules and processes in ns − 3 for 4G [279]. On the other hand, for the 5G network, since it is still quite novel, ns − 3 has not yet developed an official simulation platform with all 5G modules. Fortunately, because ns − 3 is an open-source platform, many professional network simulation users can contribute to this platform based on their requirements, such as rewriting the algorithm, adding patch packages, or doing other upgrades. Among them, I selected the model from Mezzavilla [280] to simulate the 5G mmWave performance. The following paragraphs will present some basic architecture details and model settings for both network models. Basic parameters are shown in Tab. 7.3 and Tab. 7.4.


Table 7.3: LTE Network Parameters, from 3GPP TS-36101 [281]


Table 7.4: 5G Network Parameters, From 3GPP TS-38101 [254]

#### **7.4.4 Model Parameters**

#### **7.4.4.1 Propagation Model**

For LTE, I use FriisPropagationLossModel [282]. Given an unobstructed visual path between the transmitter and receiver, the free-space propagation model can predict the strength of the received signal. According to Friis [283], the received signal strength can be described as,

$$P\_r(d) = \frac{P\_t \cdot G\_t \cdot G\_r \cdot \lambda^2}{(4\pi)^2 \cdot d^2 \cdot L} \tag{7.6}$$

where Pr(d) is defined as received signal power, P<sup>t</sup> is transmit power, G<sup>t</sup> is transmit antenna gain, G<sup>r</sup> is receive antenna gain, λ is wavelength(m), d is the distance, and L is the system loss.

As for 5G, I use MmWavePropagationLossModel [284]. This mmWave model presents two kinds of path loss models. The first one is the one that I used, which is in a statistical characteristic of the Line of Sight (LOS) state. The other one is BuildingsObstaclePropagationLossModel [285], adding the obstacle between the gNB and the UE. Further path-loss models of mmWave can be found in [286].

#### **7.4.4.2 Transmission Control Protocol/Internet Protocol (TCP/IP)**

The network transmission adopts the TCP/IP protocol. The core protocols of the TCP/IP protocol are the transport layer protocol for TCP and User Datagram Protocol (UDP) and the network layer protocol for IP, which are usually implemented in the kernel of the operating system. Because the purpose of TCP is to achieve reliable data transmission, it has a set of handshake mechanism, send - confirmation, timeout - resend [287]. In the case of video streaming, the network spending of TCP transmission is too large, thus impairing image quality and latency. Therefore, the UDP transmission method is preferred for realtime live streaming [288, 289].

#### **7.4.4.3 Hybrid Automatic Repeat Request (HARQ)**

For 4G and 5G, they both have two levels of retransmission mechanisms: HARQ at the MAC layer and ARQ at the Radio Link Control (RLC) layer [290, 291]. For 4G, the retransmission of lost or erroneous data is mainly handled by the HARQ mechanism of the MAC layer and supplemented by the ARQ of the RLC. The HARQ mechanism of the MAC layer can provide fast retransmission, and the ARQ mechanism of the RLC layer can provide reliable data transmission. In contrast, for 5G, the uplink HARQ mechanism is the same as the downlink, and both are asynchronous HARQ. There will be two kinds of changes [292]. First, the scheduling timing is more flexible, especially in TDD mode, resulting in more resource allocation flexibility. Second, the pressure of data buffering will increase. Unlike LTE's uplink synchronous HARQ, asynchronous HARQ may have a longer retransmission interval. During this time, the UE must buffer the unACKed data, which will increase the buffering pressure.

#### **7.4.4.4 Scenarios**

Three scenarios were setup for both network environments. In the first scenario, I choose 2 Mbps as the video streaming volume. 2 Mpbs is nearly the level of 720P video streaming bandwidth requirement [293]. Then I change the UE number from 2 to 20. In the second scenario, I set the UE number as a constant condition. By changing the data volume to realize new scenario, from 1 Mbps to 8 Mbps, which includes the bandwidth requirement of 720P (3 Mbps), 1080P (5 Mbps), and 3D 1080P (6 Mbps) videos [294], I tested the network performance with a varying resolution of the video. At last, I let UEs move to acquire the knowledge of how mobility condition affects the networks. For scenarios 1 and 2, the mobile machines will be under the sight of those HD cameras. Those cameras will collect the working video data and transfer it to the operator. The UEs in scenario 3 will be cameras installed on mobile machines. Here they will change their position together with the construction machinery as collecting the video streaming.

Table 7.5: Network Scenario 1


Table 7.6: Network Scenario 2


Table 7.7: Network Scenario 3


#### **7.4.5 Simulation Results**

This section presents the results of the simulated network scenarios in terms of throughput, packet loss rate, and delay. As for both network environments, I performed the simulation repeatedly and got the average value to improve accuracy.

Figure 7.11: Network topology. In this figure, the upper left corner is the origin of the coordinates. The side length of each square grid is set as 25 m.

The network topology is shown in Fig. 7.11. From node 3 to node 12 represent a set of remote devices, i.e., cameras, and the transmission data represents the video data sent by the camera avatar, which is finally sent to the user terminal (node 1) through eNodeB (node 0) or gNB (node 2), with Evolved Packet Core (EPC) or NR.

With the increase of the number of UE, the throughput simulation results are shown in Fig. 7.12(a). In the beginning, the throughput of LTE and 5G networks has increased rapidly, and the throughput matches the total data volume, which

Figure 7.12: Simulation results of scenario 1 and scenario 2.

means both of them can complete the transmission of the video streaming task.

Figure 7.13: Simulation results of scenario 3.

As the number of UEs further increases, the 5G network can still transmit video service data better; however, the LTE network cannot provide enough transmission capacity for video service data, reaching a state of business saturation. It can be observed that the throughput remains basically unchanged as the UE number grows, about 17 Mbps. In Fig. 7.12(b) simulation results of the packet loss rate as the UE number increase are shown. When the UE number is small, both LTE and 5G networks can keep the packet loss rate at a low level, i.e., almost no packet loss occurs. If the UE number increases, the 5G network can still maintain the network with an almost low packet loss rate. Still, the LTE network will have more packet loss due to its network resource constraints. It must discard the video service's data packets, causing the transmitted video to lose frames, freeze or completely lose the result of the video image, which will seriously affect the operator's performance of the construction machinery. Besides that, high latency will make a video to be out of sync. In these cases, the operator cannot grasp the on-site working environment in realtime, resulting in the operator to make wrong judgments about the working environment, which is very dangerous for the work task and the construction machinery. The average delay of the 5G network is lower than that of the LTE network, as shown in Fig. 7.12(c). This is because the 5G network can provide larger network bandwidth, increase network transmission speed, and reduce data packet delay. If the UE number is small, the average delay of the LTE network is about twice that of the 5G network; however, when the number of users is large, the average delay of the LTE network is much higher than that of the 5G network. At this time, the LTE network cannot guarantee the video streaming service.

In the second simulation scenario, the number of UE number is fixed to 8, and the video service data is increased from 1 Mbps to 8 Mbps. The simulation result of throughput with increasing video service rate is shown in Fig. 7.12(d). When the video service rates are 1 Mbps and 2 Mbps, the throughput of the LTE network and the 5G network can meet video streaming services' requirements. However, when the video service rate exceeds 3 Mbps, the throughput of the LTE network does not continue to increase, and the throughput of the 5G network still increases with the video service rate, which can guarantee the transmission of the video service. The simulation results of the packet loss rate are demonstrated in Fig. 7.12(e), where can be seen that the 5G network has been able to maintain the packet loss rate at a low level. However, severe packet loss will occur for LTE networks when a higher video service rate is required. In case that the video service rate is 5 Mbps, the packet loss rate of LTE exceeds 50%. The average delay of video services is presented in Fig. 7.12(f). 5G network continues to increase with the increase of data volume, and they are all maintained at a low level, even when the video service rate is 5 Mbps, the average delay still does not exceed 25 ms. The video service average delay of the LTE network is significantly higher than that of the 5G network. In short, as the video service rate goes higher, the improvement with 5G will be more significant.

In the third scenario, I want to simulate the case that construction machines carry the cameras with them when they change their positions. Here the video service data rate is 2 Mbps, and the number of remote devices is still 8. I simulate the longest distance up to 200 m since the longest propagation distance of mmWave is considered as 200 m [295]. The simulation results of throughput with increasing speed are shown in Fig. 7.13(a). Due to lower frequency bands, LTE network performances are affected only slightly with mobility. Also, when the UE velocity is lower than 40 km/h, the throughput of the 5G network is still in a relatively stable decline stage. In contrast, when the UE velocity exceeds 40 km/h, the throughput of the 5G network drops dramatically, and thus the transmission of video services cannot be guaranteed at this time. Fig. 7.13(b) presents, as the velocity increases, the packet loss rate is rising slowly for LTE networks. However, the 5G network will suffer a fast increasing packet loss rate when the UE moves faster than 30 km/h. In Fig. 7.13(c), both the delay of the LTE network and 5G network increase steadily with the growth of velocity. Noteworthy, the delay of the 5G network still much advantageous compared with LTE.

To sum up, 5G mmWave has significant advantages in terms of throughput, packet loss, and latency if the UEs are fixed. Although one of the requirements for 5G is the capacity to deal with high mobility, the mmWave 5G may still have a problem if the beamforming technology, concretely, tracking algorithm, is not perfect. In contrast, since 4G uses a lower frequency band, this problem is not so apparent for 4G, which hints the suitability of using sub 6 Ghz band 5G.

## **7.5 Conclusion**

In this chapter, I suggest that the IEEE 802.11p is a preferable solution for the first version of the fleet management of mobile working machines based on the analysis of the ad-hoc network and the cellular network. Moreover, I propose an analytical model to let mobile working machines have a realtime sense of the packet delay probability, mean delay, and the probability of packet loss in the ad-hoc network. That is, the machine can estimate how probable its transmission can be delayed, how long its transmission can be delayed and how many packets can be lost in realtime. Thanks to V2X technology, mobile machines can work closer and be driven faster so that the productivity of the working site can be increased dramatically.

Afterward, I indicate that 5G can be employed in the construction machines industry to improve the remote control operation and work as an essential component to achieve self-working construction machines. By taking the remote-control and self-working of construction machinery as the scenes and using video streaming transmission as the medium, I compared the LTE network's performance and the 5G mmWave network. Based on my research, I found that 5G has the capability to accomplish a better quality of live streaming so that both scenes can be significantly improved. Especially, 5G can let more cameras in the same network, indicating the possibility to acquire depth information from the video. Besides, since it is not difficult to let the machines always under the cameras' vision, I suggest letting the cameras unmoved avoid the shortcoming of mmWave. Otherwise, more robust beamforming, i.e., dynamic beamforming, algorithm is needed.

# **8 Conclusions and Future Directions**

This chapter gives a summary of conclusions that were made throughout this dissertation. Also, I specify the blank and blind spots of the conducted research, and delineate perspective on future directions. To avoid repetition, the conclusions shown in the previous individual chapters will not be shown again.

This thesis has proposed a novel concept of the smart working site initially focusing on increasing the productivity of working sites. Also, by integrating the AI and IoT technologies developed in this thesis, the safety performance and cost of the working site are expected to be ameliorated in the meantime. The expected applications are construction sites and mining sites, where currently tortured by low productivity caused by waiting, high-risk potential, and lack of laborers. An individual technology cannot achieve the goal since working sites are complicated and should be optimized as holistic systems.

To provide an alternative, I have presented the fleet management solution. Considering a group of mobile machines as a whole, I showed the blueprint of a future working site using five complementary technologies to make this concept closer to reality: multi-working machines pathfinding algorithm, multi GPS/IMU SLAM system to offer terrain information, working process detection algorithm, visual monitoring system, and wireless communication system.

The validity of the proposed model has been substantiated by comprehensive experiments. Since I expect the smart working site concept to make a sudden change in the industry, the experiments' results were gained with commodity hardware, or the parameters setting of the validation simulations were done based on affordable sensor's datasheet.

Besides the contributions are shown in the Chapter 1 that I have pushed forward the SOTA solutions for the individual presented task, I believe another main contribution of this thesis is that I quantitatively evaluated and proved the feasibility of future smart working considering the cutting edge AIoT technologies.

Regarding future extensions, although I did some contributions, the concept of smart working site is a comprehensive topic and cannot be completed with only one dissertation. In this section, I point out the shortcomings of our research and give directions for improvements.

Path Planning: As I mentioned in the literature review, path planning is a fast developing and prosperous research field; thus, I did not fully consider all the improved methods for our method's initial version. For instance, I did not add in the mega-agents concept, which merges the agents together based on some specific rules to reduce the conflicts and thus speed up the searching process. Since I confirm that our method can be combined with these methods, I expect the searching process to be accelerated further.

In addition, although Huoshenshan's working site proved the concept, the more machines invested, the faster is the project, which is also consistent with our subjective imagination, a comprehensive study on the quantitative relationship between the number of machines invested and the productivity of the working site has not been done. I encourage the experts in civil engineering to propose some challenging scenarios and test our algorithm on them.

SLAM: In our research, I show the method to create a map with only one mobile machine. However, in a real working site, many mobile machines work simultaneously on the construction site, indicating the possibility of creating a map even faster if the machines can share the information. Thus, I encourage the researchers to enable the cooperative map drawing approach by means of WiFi or 5G.

Motion Prediction: Although the proposed deep learning algorithm can successfully handle the time series problem to know the machine's working process, a combination with video technology can surely improve the motion prediction accuracy. This combination has not been done in this thesis. Another regret is that I did not spend much time optimizing the CRDNNs to detect truck loading processes due to the limited time. With further optimization, at least the training parameters can be reduced so that an even faster CRDNN can be expected.

Human Machine Communication: The IoT system designed for humanmachine communication shall be further developed due to its potential. The connected mobile machine is undoubtedly a research focus shortly. While the Bluetooth technology is considered as a cheap and reliable communication solution for human and machines interaction, I believe the next generation communication tools should have access to cellular networks (4G or 5G) since the other components, such as hydraulic pump and hydraulic motor, of the mobile machines also have the requisite to connect to the communication networks for components monitoring, which might overload the Bluetooth. Moreover, I believe fleet management can facilitate the industry of mobile machines. Therefore, in the next generation of the connection system, I will take advantage of 5G to achieve a fully connected working site. Thanks to the cloud, CRDNN can be further trained with newly gathered data whenever a customer label the new dataset for their newly developed mobile construction machines and thus become even more reliable.

The MOMA Dataset: The task of object detection relates to a wide range of knowledge, experience, and hardware allocations. A further in-depth study of mobile machine detection algorithms to promote their performance in precision and fps is highly recommended. Current MOMA dataset is relatively small and only suitable for level-four tasks. To achieve better performance, the size of the dataset should be increased. Besides algorithmic improvement, some improvements in the dataset can be concluded as follows. In this dataset, the mobile machines are treated as a whole, whereas perceiving component or subassembly of mobile machines makes sense somehow, for instance, bucket or backhoes of an excavator. In addition, collect extra data of mobile machines in extreme poses if needed. The majority of mobile machines work in typical poses, for instance, an excavator sits on the ground or even in the water, with its bucket moving around; a wheel loader loads coal and unloads it. However, machines must work in extreme poses in some situations, e.g., a dumper deposits earth or a wheel loader buried in the earth but still feebly recognizable by human. By collecting more images like this may expand the scope of model application. Finally, besides object detection, computer vision is also trending to image segmentation. Pixel-level semantic segmentation can also improve the detection performance of predictors.

V2X Communication – 5G: Since I use video as the medium to test the performance of the two networks, future work shall refine video factors and explore how the structure of the different encoding video styles will affect the networks. Besides, starting from the video phase, through the networks, and finally to the control operator, a simulation analysis of the entire link can be carried out to improve this article's content. Moreover, as the 6G technology is on the way [296, 297, 298], the possibility of benefiting the construction machine industry from 6G technology shall be explored.

Obviously, the research about smart working site is just at the beginning.

# **List of Figures**










# **List of Tables**



# **Bibliography**


framework," International Journal of Project Management, vol. 33, no. 6, pp. 1405–1416, 2015.


International Journal of Project Management, vol. 35, no. 4, pp. 686–698, 2017.


International Conference on Computing and Communications Applications and Technologies (I3CAT), Virtual, 2021, pp. 1–8.


online swarm intelligent programming," IEEE Transactions on Industrial Informatics, vol. 14, no. 9, pp. 4149–4158, 2017.


Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 2019, pp. 442–449.


information processing systems, Vancouver, British Columbia, Canada, 2008, pp. 985–992.


# **List of Publications**

## **Journal Articles**

Xiang, Y., K. Liu, T. Su, J. Li, S. Ouyang, S. Mao, and M. Geimer, "An extension of BIM using AI: A multi working-machines pathfinding solution," IEEE Access, vol. 9, pp. 124 583–124 599, 2021.

Xiang, Y., T. Tang, T. Su, C. Brach, L. Liu, S. S. Mao, and M. Geimer, "Fast CRDNN: Towards on site training of mobile construction machines," IEEE Access, vol. 9, pp. 124 253–124 267, 2021.

Xiang, Y., R. Li, C. Brach, X. Liu, and M. Geimer, "A novel algorithm for hydrostatic-mechanical mobile machines with a dual-clutch transmission," Energies, vol. 15, no. 6, 2022. [Online]. Available: https://www.mdpi.com/1996-1073/15/6/2095

Xiang, Y., D. Li, T. Su, Q. Zhou, C. Brach, S. S. Mao, and M. Geimer, "Where am I? SLAM for mobile machines on a smart working site," arXiv, pp. 1–14, 2020. [Online]. Available: arXiv:2011.01830

Xiang, Y., H. Wang, T. Su, R. Li, C. Brach, S. S. Mao, and M. Geimer, "KIT MOMA: A mobile machines dataset," arXiv, 2020. [Online]. Available: arXiv:2007.04198

J. Cai, J. Zhao, Xiang, Y., J. Liu, G. Chen, Y. Hu, and J. Chen, "Can I trust you? Estimation models for e-bikers stop-go decision before amber light at urban intersection," Journal of advanced transportation, vol. 2020, no. 6678996, pp. 1–17, 2020.

Y. Liu, J. Qiao, Y. Hu, T. Fang, T. Xu, Xiang, Y., and Y. Han, "Determination of curve speed zones for mountainous freeways," Mathematical Problems in Engineering, vol. 2020, no. 8844004, pp. 1–11, 2020.

J. Li, W. Hong, and Xiang, Y., "A short review on data modelling for vector fields," arXiv, 2020. [Online]. Available: arXiv:2009.00577

## **Conference Contributions**

Xiang, Y., T. Su, C. Brach, X. Liu, and M. Geimer, "Realtime estimation of ieee 802.11p for mobile working machines communication respecting delay and packet loss," in Proc. IEEE Intelligent Vehicle Symposium, Las Vegas, USA, 2020, pp. 1516–1521.

Xiang, Y., S. Wang, T. Su, J. Li, S. S. Mao, and M. Geimer, "KIT bus: A shuttle model for CARLA simulator," in Proc. IEEE Industrial Electronics and Applications Conference (IEACon), Virtual, 2021, pp. 1–6.

Xiang, Y., B. Xu, T. Su, C. Brach, S. S. Mao, and M. Geimer, "5G meets construction machines: Towards a smart working site," in Proc. IEEE International Conference on Computing and Communications Applications and Technologies (I3CAT), Virtual, 2021, pp. 1–8.

Xiang, Y. and M. Geimer, "Optimization of operation startegy for primary torque based hydrostatic drivetrain using artificial intelligence," in Proc. 12th International Fluid Power Conference, Dresden, Germany, 2020, pp. 55–65.

Xiang, Y., S. Mutschler, N. Brix, C. Brach, and M. Geimer, "Optimization of hydrostatic-mechanical transmission control strategy by means of torque control," in Proc. 12th International Fluid Power Conference, Dresden, Germany, 2020, pp. 421–431.

L. Liu, B. Li, G. Götting, Xiang, Y., Q. Salem, M. Hamid, and J. Xie, "Loss minimization of traction systems in battery electric vehicles using variable dc-link voltage technique—experimental study," in Proc. 22nd European Conference on Power Electronics and Applications (EPE'20 ECCE Europe), Lyon, France, 2020, pp. P1–P8.

S. Mutschler, N. Brix, and Xiang, Y., "Torque control for mobile machines," in Proc. 11th International Fluid Power Conference, H. Murrenhoff, Ed. Aachen, Germany: RWTH Publications, 2018, pp. 186–195.

J. Yang, F. Lin, Xiang, Y., P. Katranuschkov, and R. Scherer, "Fast crack detection using convolutional neural network," in Proc. 28th International Workshop on Intelligent Computing in Engineering, Berlin, Germany, 2021, pp. 540–549.

## **Supervised Theses**

I did accept all the applications if the students were at their final phase of the study. The selections were not based on their previous grades, gender, and nationality. All the theses which were done at KIT were under the supervision of Yusheng Xiang and Marcus Geimer. In addition, the thesis from Mr. Julian Kreis were supervised by Yusheng Xiang and Klaus Allmendinger. The order is based on the date of our on-site interview. I have only listed the theses for which I am responsible for the quality of the theses.

Li, Ruoyu, "Optimization of control strategy for dual clutch transmission on torque based mobile machines," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2019.

Julian, Kreis, "Optimierung eines schaltalgorithmus für ein doppelkupplungsgetriebe in einer mobilhydraulikapplikation, (english title: Optimization of a shift algorithm for a dual clutch transmission in a mobile hydraulic application)," Bachelor thesis, Department of Mechanical Engineering, Technische Hochschule Ulm, Germany, 2020.

Wang, Hongzhe, "Evaluating state-of-the-art object detectoron challenging mobile machine dataset," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Tang, Tian, "Working state identification of torque based mobile machines using combined neural networks," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Wang, Long, "Use-case-basierte Lastkollektive Entwicklung für Mildhybridanwendung auf 48V Basis (English title: Use-case-based load collective development for a mild hybrid 48V application)," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Su, Jiamei, "Evaluation of SSD object detector on challenging mobile machines dataset," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Li, Dianzhao, "Where am I? SLAM for mobile machines on a smart working site," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Sun, Bosheng, "An analysis of the sensors arrangement for driverless city buses," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Mei, Tao, "Arrangement of the sensors for driverless city buses in challenging road scenarios," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Cui, Qiangguo, "Estimation of working machines mass and road grade with recursive least square method," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Xu, Bing, "5G meets construction machinery: Towards a smart working site," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Xiong, Xinyi, "A mobile machine with legged system," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Fu, Meiqi, "Driving speed of torque based mobile machines by means of MPC controller," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2020.

Zhao, Hengping, "An approach for multi-construction machinery pathfinding in construction site," Bachelor thesis, Karlsruhe Institute of Technology and Beijing Institute of Technology, Germany and China, 2020.

Zhao, Yang, "Performance evaluation of IEEE 802.11p for mobile working machines communication respecting delay and packet loss," Bachelor thesis, Karlsruhe Institute of Technology and Beijing Institute of Technology, Germany and China, 2020.

Chen, Yucheng, "The advantages of application of CBS in construction sites," Bachelor thesis, Karlsruhe Institute of Technology and Beijing Institute of Technology, Germany and China, 2020.

Gao, Xiaochen, "Design and control of a walking mechanism of an excavator in complex scenarios," Bachelor thesis, Karlsruhe Institute of Technology and Beijing Institute of Technology, Germany and China, 2020.

Zhang, Zhenduo, "Instance segmentation of engineering plants in real construction environment based on Mask-RCNN," Bachelor thesis, Karlsruhe Institute of Technology and Beijing Institute of Technology, Germany and China, 2020.

Huang, Yanliang, "Object recognition on the construction site based on Mask R-CNN," Bachelor thesis, Karlsruhe Institute of Technology and Beijing Institute of Technology, Germany and China, 2020.

Hu, Qiankun, "KIT MOMA V2: Towards instance segmentation of construction machines," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2021.

Wang, Shuo, "A passenger bus model for CARLA simulator and implement simulation," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2021.

Niu, Zhuo, "Research on a rapid recognition method of construction machinery based on the improved YOLO-V4 algorithm," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2021.

Liu, Kailun, "An Extension of BIM Using AI: A Multi Working-Machines Pathfinding Solution," Master thesis, Institute of Vehicle System Technology, Karlsruhe Institute of Technology, Germany, 2021.

### **Karlsruher Schriftenreihe Fahrzeugsystemtechnik FAST Institut für Fahrzeugsystemtechnik (ISSN 1869-6058)**

Eine vollständige Übersicht der Bände finden Sie im Verlagsshop


Die Bände sind unter www.ksp.kit.edu als PDF frei verfügbar oder als Druckausgabe bestellbar.



## Karlsruher Schriftenreihe Fahrzeugsystemtechnik

Infrastructure construction is society's cornerstone and economics' catalyst. Therefore, improving mobile machinery's efficiency and reducing their cost of use have enormous economic benefits in the vast and growing construction market. Instead of focusing on improving the performance of single construction machinery, I considered a group of construction machinery as a whole system to improve the productivity of the working site. In this book, I envision a novel concept smart working site to increase productivity through fleet management from multiple aspects and with Artificial Intelligence (AI) and Internet of Things (IoT).

Investigating the famous construction site for the hospital, namely Huoshenshan, where the project was finished at an unprecedented speed in Wuhan during the coronavirus outbreak in 2020, the most impressive distinguishing features can be concluded as a large amount of machines investment and the well-ordered coordination. Inspired by this particular working site, this book aims to present the approaches to substitute some human coordinators using AI and IoT and thus make the concept of a smart working site offering high productivity closer to reality.

ISSN 1869-6058 ISBN 978-3-7315-1165-6 Y. Xiang

AI and IoT Meet Mobile Machines: Towards a Smart Working Site

**Band 97**